Social Network Analysis for Business Process Discovery

Transcription

1 Social Network Analysis for Business Process Discovery Cláudia Sofia da Costa Alves Dissertation for the degree of Master in Information Systems and Computer Engineering Supervisor: President: Vogal: Prof. Doutor Diogo R. Ferreira Prof. Alberto Manuel Rodrigues da Silva Prof. Miguel Leitao Bignolas Mira da Silva July 2010

2 placeholder

3 Acknowledgments To my family, especially my parents, who have always supported me during my academic carrier. To Prof. Diogo Ferreira for his excellent assistance and availability to help. The support and guidance I received throughout this year greatly improved the value of this dissertation. To Álvaro Rebuge,a member of our research group, for the exchange of ideas and knowledge that were really helpful for the development of the case study. Last but not the least, to all my close friends, colleagues and all the others that marked my life course during or before this master degree a special compliment is due. iii

4 placeholder

5 Abstract The organizational perspective of Process Mining is a valuable technique that allows discovering the social network of an organization. By doing so, provides means to evaluate networks by mapping and analyzing relationships among people, teams, departments or even entire organizations. However, when analyzing networks of large size, Process Mining techniques generate highly complex models, usually called spaghetti models, that may be confusing and difficult to understand. In this dissertation we present an approach that aims to overcome this difficulty by presenting the information in a way that can be easily read by users. Clustering techniques adopting a divide-and-conquer strategy are applied for this purpose as they make possible the user to visualize and analyze the network in different levels of abstraction. Our approach also makes use of the concept of Modularity, indicating which iteration of the clustering algorithm best represents different user groups in the social network. This approach was implemented in the ProM framework and all the experiments were performed in that environment. Taking into consideration the results achieved for a realworld case study and the results of several experiments, we reached the conclusion that the approach is capable of dealing with complex logs and that the Modularity concept provides a good hint of which group of clusters best represents the user groups in a social network. Keywords: Process Mining, Social Network Analysis, ProM Framework, Clustering, Agglomerative Hierarchical Clustering, Organizational Modelling, Communities, Modularity v

6 placeholder

7 Resumo Ap erspectiva organizacional da extracção de processos é uma técnica importante que permite descobrir a rede social de uma organização. Esta técnica fornece meios para avaliar redes sociais através do mapeamento e da análise das relações existentes entre pessoas, equipas, departamentos ou até mesmo organizações inteiras. No entanto, quando se procede a análise de redes sociais de grandes dimensões, as técnicas actuais geram modelos muito complexos. Com o objectivo de superar esta dificuldade, apresentamos neste trabalho uma abordagem capaz de representar grandes quantidades de informação de forma simples e de modo a facilitar a análise e a compreensão dos dados. As técnicas de clustering podem ser usadas para este propósito uma vez que permitem analisar a informação da rede a diferentes níveis de abstracção. A nossa abordagem adopta um algoritmo de Clustering Hierárquico Aglomerativo. O conceito de Modularidade foi também adoptado com o objectivo de determinar qual a iteração do algoritmo que melhor representa as comunidades existentes na rede. A abordagem foi implementada na ferramenta ProM. Para demonstrar a sua aplicação foi realizado um caso de estudo real, e tendo em consideração os resultados obtidos concluímos que a abordagem é capaz de lidar com logs complexos e que o conceito de modularidade realmente fornece um ideia de qual o grupo de comunidades que melhor representa os grupos sociais da rede. Palavras-Chave: Extracção de processos, Análise de Redes Socias, Ferramenta ProM, Clustering, Clustering Aglomerativo Hierárquico, Modelação Organizacional, Comunidades, Modularidade vii

8 placeholder

9 Contents Acknowledgments Abstract Resumo iii v vii 1 Introduction Process Mining Motivation Document Structure Mining the Organizational Perspective Deriving social networks from event logs Techniques for Social Network Mining Social Network Miner Organizational Miner Role Hierarchy Miner Semantic Organizational Miner Staff Assignment Miner The ProM Framework Conclusion Social Network Analysis Social Network Analysis (SNA) SNA Measures Measures for an individual level Measures for the network level Finding community structures in networks ix

10 3.3.1 Traditional Approaches Recent Approaches Conclusion Proposed Approach Motivation Proposal Application of Agglomerative Hierarchical Clustering in SNA Displaying social networks Conclusion Implementation in ProM Extracting information from Log file Agglomerative Hierarchical Clustering Modularity Definition of Modularity Working Together vs. Similar Tasks Conclusion Case Study Similar Tasks Working Together First Approach - Who works with whom? Second Approach - Which specialties work together? Relationship to the Business Process Conclusion Conclusions Main Contributions Future work Bibliography 70 A Log File - insuranceclaimhandlingexample.mxml 74 B User Manual for the Social Network Mining Plug-in 81 x

11 placeholder

12 List of Tables 2.1 Table representing the content of a fragment from an event log Table representing the information in insuranceclaimhandlingexample.mxml event log Information extracted from the Log file (insuranceclaimhandlingexample.mxml). This matrix shows the existing links among vertices Adjacency matrix, of insuranceclaimhandlingexample.mxml used to compute modularity Originators Degree of insuranceclaimhandlingexample.mxml social network This table shows how many times each originator performs each task This table shows how many tasks two originators perform in common Characteristics of the three Hospital Log Files xii

13 placeholder

14 List of Figures 1.1 Business Process Management life cycle showing the three phases where process mining is focused (dark blue circles) Doing Similar Tasks as displayed in ProM Hierarchical clustering result represented as a dendogram (Snapshot from ProM v5.2) Organization model derived from the dendogram. Ovals and the pentagons represent actors/originators and organizational entities respectively. (Snapshot from ProM v5.2) Overview of the ProM Framework (adapted from [27]) MXML format (adapted from [8]) MXML snapshot Network with community structure. In this case there are three communities (represented by the dashed circles) composed by densely connected vertices. Links of lower density (depicted with thinner lines) are the ones that establish a connection between the different communities Comparison of the different phases supported by the ProM and other softwar packages during a social network analysis Output of Prom 5.2 using the Working Together mining tool applied on a small network. In this case we used DecisionMinerLog.xml supplied by ProM Output of Prom 5.2 using the Working Together mining tool applied on a large network. In this case we used outpatientclinicexample.mxml supplied by ProM 5.2. It is relevant to say that the mining result image is just a tiny part of the real network Social Network of the insuranceclaimhandlingexample.mxml. This screenshot corresponds to the 1st iteration of Working Together AHC Algorithm, using tie break with modularity. At this point each cluster corresponds to a single originator Matrix showing the relationships among originators of the social network depicted in Figure xiv

15 Social Network of the insuranceclaimhandlingexample.mxml. This screenshot corresponds to the 3rd iteration of Working Together AHC Algorithm, using tie break with modularity. Here the relationships among originators of the social network are represented. Originators from the same cluster are represented by the same colour Social Network of the insuranceclaimhandlingexample.mxml. This screenshot represents the organization units at the 3rd iteration of Working Together AHC Algorithm, using tie break with modularity Modularity Chart Similar Tasks Algorithm - Social Perspective Similar Tasks Algorithm - Organizational Perspective Similar Tasks - Modularity Best Case Similar Tasks - Modularity Worst Case Social network of the event log with 12 days. This is the output of the iteration with the highest modularity of Average Linkage with tie break Matrix from log 12 days showing relationships among nurses Social network of the event log with 12 days. This is the output of the iteration with the highest modularity of Complete Linkage with tie break. GREEN = Emergency, BLUE = Pediatrics; PINK = Obstetrics/Gynecology, RED = Orthopedics, ORANGE = Emergency relay, DARK PURPLE = General surgery, LIGHT PURPLE = Neurology and BROWN = Internal Medicine Social network of the event log with 14 days. This is the output of the iteration with the highest modularity of Single Linkage Social network of the event log with 14 days. This is the output of the iteration with the highest modularity of Complete Linkage Emergency Department Business Process Matrix view of the sub-network from Organization Unit 0 from iteration with the highest modularity of Average Linkage with tie break from 12 days event log Graph view of the sub-network from Organization Unit 0 from iteration with the highest modularity of Average Linkage with tie break from 12 days event log

16 Chapter 1 Introduction Nowadays we live in a very competitive market, where customer s needs and expectations are always changing. Industry requirements are also changing and many mergers and acquisitions are taking place. All these permanent changes are a challenge for organizations. To gain a competitive advantage, organizations must revise, change and improve their strategic business processes, in a fast and efficient way, in order not to lose market share. To optimize a business process, organizations must understand how the process is being performed, which usually involves a long period of analysis, including interviews with all the persons responsible for a given part of the process. The use and proliferation of Process-Aware Information Systems cite2 (such as ERP, WfM, CRM and SCM systems) has led the way to a more efficient type of method to study the execution of processes, called process mining [17]. These systems typically record events carried out during a business process execution on event logs and the analysis of those logs can yield important knowledge to improve quality of the organization s services and processes. Here is where process mining comes in. In next Sections we will explain how Process Mining concept appeared and what it its purpose. 1.1 Process Mining Business Process Management (BPM) systems are an effort to help organizations managing process changes that are required in many areas of the business [3]. These systems have been widely used and are the best methodology so far. Ideally, they should provide support for the complete BPM life-cycle (Fig. 1.1): (re)design, modelling, execution, monitoring, and analysis of processes. However, existing BPM tools are unable to support the full life-cycle. These tools provide strong support in design, configuration and execution phases. Nevertheless, process monitoring, analysis and redesign phases receive limited support[18]. One reason to say this lays in the fact that the analysis phase is focused in processes performance, being the major goal identifying their weaknesses. Unfortunately, this phase is limited to simple performance indicators, such as flow time. 2

17 1.1. PROCESS MINING 3 Figure 1.1: Business Process Management life cycle showing the three phases where process mining is focused (dark blue circles) For a further analysis, i.e., identifying structures or patterns in processes and organizations, BPM systems require human intervention because these systems are not able to highlight weaknesses automatically, much less suggest improvements[18]. Therefore, the re-design phase is affected, because has no information to be able to suggest alternatives for the design phase. Besides this problem, there is no interoperability between some of the phases, i.e., some of the results generated by one of the phases, cannot be used as an input by the next phase of the life cycle, and requires human intervention to interpret, map and re-introduce the information in the correct format on the next phase[18]. Process mining plays a very important role in trying to fulfil these gaps by giving support to the life cycle phases with event logs information. Providing a bottom-up approach, process mining techniques can be used to support the redesign and diagnosis phases by analyzing the processes as they are being executed. Process mining requires the availability of an event log. In effect, event logs are widely available today. They may originated from all kinds of systems, ranging from enterprise information systems to embedded systems. Process mining is a very wide area as it can be applied in fields such as: hospitals, banks, embedded systems in cars, copiers, and sensor networks [17, 18, 21]. Process Mining Perspectives Process mining research can be focused in many fields/perspectives, but three of them deserve special emphasis: (1) the process perspective ( How? ), (2) the organizational perspective ( Who? ) and (3) the case perspective ( What? )[17, 21, 27]. Following in a explanation of each one.

18 4 CHAPTER 1. INTRODUCTION 1. Process perspective focuses on the control-flow, i.e., the ordering of activities and the goal here is to find a good characterization of all the possible paths, e.g., expressed in terms of a Petri net. 2. Organizational perspective focuses on the resources, i.e., which performers are involved in the process model and the way they are related. The main goals are: structure the organization by classifying people in terms of roles and organizational units; and show relationships among performers. 3. Case perspective focuses on properties of cases. Cases can be characterized by their paths in the process or by the values of the corresponding data elements, e.g., if a case represents a supply order it is interesting to know the number of products ordered. In each of the above perspectives, there are three orthogonal aspects: (1) discovery, i.e., generates a new model based on event logs information; (2) conformance checking, i.e., exposes the differences between some a-priori model and a real process model constructed based on an event log; and (3) extension, i.e., an a-priori model is enrich and extended with new aspects and perspective of an event log [23]. Therefore, all researches in the Process mining can be classified according to two dimensions: the type of the mining and the perspective. This dissertation focuses on the discovery aspect of the Organizational perspective. 1.2 Motivation Several tools of process mining analysis are available in the market, although only few of them support all Process Mining perspectives. After analyzing some of the available tools in the market, we have came to the conclusion that ProM 1 (an extensible framework for process mining) is one of the most complete tools. However ProM is a powerful tool, when analyzing organizational perspective of networks with huge dimensions, we are faced with some challenges. The main reasons for these challenges are basically two: 1) the deficient representation of data by ProM. This framework uses a very rudimentary tool to represent graphically huge amount of data, becoming very challenging to the user to analyze and explore the graphs that represent the network; 2) ProM is only able to map relationships between two individuals, it cannot map relationship among communities, teams or groups. Therefore, the main goal of this dissertation is to develop a new technique capable of identify communities in networks, i.e., sub-groups in the networks in which internal connections are dense, and external are sparser. Furthermore, we want to provide this divide-and-conquer approach with advanced visualization techniques that can show a progressive formation of the communities. To do so, we will implement a Agglomerative Hierarchical Clustering (AHC)that, will not only help us identifying communities inside the network, but it will also help us to simplify the representation and visualization of the big amount of data required in this kind of analyses. 1 For more information visit

19 1.3. DOCUMENT STRUCTURE 5 In this proposal we have also adopted a new concept - Modularity - which is a quality measure able to identify which group of clusters is the best and closer to the reality. After developing this new technique it will be implemented as a plug-in in ProM v6. The motivation, goals of our proposal will be further explained in Chapter 4 and Chapter Document Structure This document is organized as follows: Chapter 2 focuses on the mining process of organizational perspective. We broadly explain all the process, since the extraction of information from event logs, until the use of the information to build meaningful sociograms. This chapter also introduces techniques developed for social network analysis. Finally we introduce ProM framework (the framework where we have implemented our proposed technique) and the standard format of event logs used in this framework. Chapter 3 introduce concepts from Social Network Analysis (SNA), such as metrics used to analyze social networks and the most well-known algorithms to find communities in networks. The content of this chapter and Chapter 1 will be used as background information, needed to understand the following chapters. Chapter 4 presents a superficial comparison between ProM and other existing software for social network analysis. This superficial comparison leads us to the motivation of our work. After pointing out the challenges, we will present the main goals of our proposal. Chapter 5 describes our plug-in and its implementation. We first explain how and which information is extracted from the log-file and generates the input of our plug-in. Then we explain how the input is treated all along the different stages of the plug-in. Chapter 6 demonstrates the approach in a real-world case study where the goal was to validate the plug-in. In this chapter we also show and explain some features and outcomes of our plug-in. Finally in chapter 7 we draw conclusions about this dissertation and suggest some future work. This dissertation has two appendixes: Appendix A consists in a event log used as an example, that helps to explain how our technique was implemented in Chapter 5. Appendix B consists in a user manual as an effort to better present our plug-in - Organizational Miner Cluster plug-in.

20 placeholder

21 Chapter 2 Mining the Organizational Perspective The goal of process mining is to extract useful information from event logs that record the activities an organization performs. As it was described in the previous chapter, process mining can extract information according three different perspectives: 1. Process perspective focuses on the control-flow, i.e., the ordering of activities and the goal here is to find a good characterization of all the possible paths, e.g., expressed in terms of a Petri net. 2. Organizational perspective focuses on the resources, i.e., which performers are involved in the process model and the way they are related. The main goals are: structure the organization by classifying people in terms of roles and organizational units; and to show relationships among performers. 3. Case perspective focuses on properties of cases. Cases can be characterized by their paths in the process or by the values of the corresponding data elements, e.g., if a case represents a supply order it is interesting to know the number of products ordered. In this chapter we focus on the main topic of this dissertation: the organizational perspective, more precisely in the mining of this perspective. Mining is the method for distilling process description from a set of real executions (stored in event logs). We focus only on the descriptions extracted from log events that are helpful and valuable for the organizational perspective. We will start by explaining from where, process mining extracts the information to derive social networks and finally we will explain which information is used to derive these social networks. 2.1 Deriving social networks from event logs Since the past few years Process-Aware Information Systems (such as ERP, WFM, CRM and SCM systems) have suffered a high proliferation which has lead the way to a more 7

22 8 CHAPTER 2. MINING THE ORGANIZATIONAL PERSPECTIVE efficient type of method to study the execution of processes - Process Mining. These systems provide a kind of event logs, also known as workflow log or audit trail entry. In an event log all events executed during a business process execution are recorded and its analysis can yield important knowledge to improve the execution of processes and the quality of the organization s services. For all process mining technique, an event log is needed as input. Basically, an event log is the basis and the source that supplies all the information necessary to derive sociograms and proceed with this kind of analysis. An event log is a set of events. Each event in the log is linked to a particular trace and is globally unique (i.e., can not appear twice in the same event log). Each event refers to an activity which is related to a particular trace and is recognized by an unique identifier and can have several properties associated, like: timestamp; the activity name; resource or performer, which is the person that performed the activity; and event type of the activity, normally the type of an activity is classified as: start or complete. Thus an event may be denoted by (c, a, p) where c is the case, a is the activity, and p is the person. A trace, also known as case, represents a particular process instance and is a sequence of events such that each event appears only once. To clarify notions mentioned above let us consider an example adapted from [26]. Consider the emergency treatment process in a hospital. Each case in this process refers to patient treatment in emergency. Examples of activities are triage, blood tests, consultation of a specialist, take a scan, etc. The activities are performed by all kind of health-care professional, such as: doctors, nurses, radiologists, surgeons, etc. Example of an event may be taking a thorax scan to a patient by a radiologist at a given point of time. The event log for emergency treatment process will contain all events for this process. A more abstract example of an event log is shown in Table 2.1. In this example the event log is composed by two process instances and each trace consists of a number of events. For example, the first trace is composed by four events (1a, 1b, 1c and 1d) with different properties. Trace Event Properties Activity Resource Timestamp Type 1a A Mary :00 start 1 1b A Mary :13 complete 1c B John :16 start 1d B John :40 complete 2 2a A Angela :30 start Table 2.1: Table representing the content of a fragment from an event log Now that we have explained carefully which information is stored in event logs, we are now able to explain the metrics that have been developed, to use the information to derive meaningful sociograms. Inside organizational perspective scope, some complex metrics have been studied. We identify four types of metrics that can be used to establish relationships between individuals: (1) metrics based on (possible) causality, (2) metrics based on joint cases, (3) metrics based on joint activities, and (4) metrics based on special

23 2.2. TECHNIQUES FOR SOCIAL NETWORK MINING 9 event types [25]. This metrics are possible because events are ordered in time, allowing the inference of casual relationship between activities and the corresponding performers. Metrics based on (possible) causality monitor for individual cases how work moves among performers. Examples of such metrics are: handover of work and subcontracting. We will explain in a short way which information from event logs is used in this metric. We shall consider Handover of Work metric and the event log depicted by Table 2.1. Handover of Work determines who gives work to whom, and from the event log this information can be extracted from two subsequent activities in the same case. For example, in case1 Mary starts and completes activitya, and right next in the same case, John starts and completes activityb. Thus we can assume that Mary has delegated or passed work to John. Metrics based on joint cases count how frequently two individuals are performing activities for the same case. The metric Working Together is an example of these. We will explain in a short way which information, from event logs, is used to make working together analysis. For example, the event log depicted by Table 2.1 we shall consider event 1a (Trace1, A, Mary) and event 1b (Trace1, B, John). Mary and John, despite of performing different activities, they perform activities in the same case, thus we can assume that they work together. This metric is explained further in the Section 2.2. Metrics based on joint activities do not consider how individuals work together on shared cases but focus on the activities they do. One example of the application of this metric is Similar Task Metric, which is also explained in section 2.2. We will explain in a short way which information from event logs is used to make similar tasks analysis. Each performer has a profile which stores the frequency the performer executes each task. This metric determines the similarity of two performers based on the similarity of their profile. For example, in the event log depicted by Table 2.1 we can observe that Mary only performs activities of type A, John only performs activities of type B and Angela only performs activities of type A. So according to this, since Mary and Angela perform the same type of activities, they are more similar than Mary and John that have completely different profiles. Metrics based on special event types consider the type of event. Using these metrics we obtain observations that are particularly interesting for social network analysis because they represent explicit hierarchical relationships. One example of the application of this metric is Reassignment metric, which is also explained in Section Techniques for Social Network Mining This section discusses a set of existing mining techniques for social network analysis developed until nowadays. The techniques that we will introduce apply all the metrics discussed above.

24 10 CHAPTER 2. MINING THE ORGANIZATIONAL PERSPECTIVE Figure 2.2: Doing Similar Tasks as displayed in ProM Social Network Miner The main idea of this technique is to monitor how individual process instances are routed between actors. The technique provides five kinds of metrics to generate social networks [26]: Handover of work metric: This metric determines who passes work to whom. This information can be extracted from an event log finding subsequent activities in the same case (i.e., process instance), where the first activity is completed by one individual and the second one is completed by another individual. Subcontracting metric: This metric is similar to Handover of work metric. While in the previous one relationship between two individual is unidirectional, in this one is bidirectional. Considering a single case of an event log and two individuals, we know that individual i subcontracts individual j, when in-between two activities executed by individual i there is an activity executed by individual j. Working together metric: Two individuals work together if they perform activities in the same case of an event log. This technique only counts how frequently individuals work in the same case. Similar task metric: All the techniques above are based on joint cases, this one is based on joint activities. The main idea is to determine who performs the same type of activities. To do so, each individual has his own profile based on how frequently they conduct specific activities. Then the profiles are compared to determine the similarity. An example of this technique is shown in Figure 2.2. Reassignment metric: The basic idea of this metric is to detect the reassigning of activities from one individual to another: if i frequently delegates work to j but

25 2.2. TECHNIQUES FOR SOCIAL NETWORK MINING 11 not vice versa it is likely that i is in a higher hierarchical than j Organizational Miner This technique works at a higher level of abstraction than the previous techniques. While the Social Network Miner works at the level of the individual, the Organizational Miner technique works at the level of teams, groups or departments. Actually, organizational miner has five kinds of metrics to generate organizational networks: Default Miner: It is simple algorithm that shows clearly the relationship between tasks and the originators (activities performers). Although this metric belongs to the organizational miner, it only derives a flat model, excluding all kind of clustering. Doing Similar Tasks: This technique joins all the originators that perform similar task in the same group. Hierarchical Mining Clustering: On the contrary of the two previous techniques, this one derives a hierarchical model. This technique implements the Agglomerative Hierarchical Clustering technique based on joint activities. It means that the clusters are determined according the activities that each originator performs. (Fig. 2.3) shows the dendogram derived from this technique. Through the dendogram, this technique allows us to derive flat or disjoint organizational entities by cutting the dendogram with a certain value. Figure 2.3 shows, by cutting the dendogram using a cut-off of value 0,2698 we obtain three clusters. Figure 2.4 shows the organizational entities derived from this dendogram. Working Together: Opposing all the metrics mentioned above, this is a metric based on joint cases and not on joint activities. This technique helps identifying teams. It puts in the same group, all the originators that participate in the same cases. Figure 4.10 and Figure 4.11 are examples of the result of this technique. Self Organizing Map (SOM): This algorithm is an unsupervised method that performs, at the same time, a clustering and a non linear projection of a dataset. SOM is a neural network technique that arranges the data according to a low dimensional structure. The original data is partitioned into as many homogeneous clusters as units, in such way that close clusters contain close data points in the original space. In other words, similar cases are mapped close to one another in the SOM [2, 24] Role Hierarchy Miner This technique is similar to the Doing Similar Tasks technique, however it takes the analysis to a higher dimension - organizational dimension. This technique is also based on

26 12 CHAPTER 2. MINING THE ORGANIZATIONAL PERSPECTIVE Figure 2.3: Hierarchical clustering result represented as a dendogram (Snapshot from ProM v5.2). Figure 2.4: Organization model derived from the dendogram. Ovals and the pentagons represent actors/originators and organizational entities respectively. (Snapshot from ProM v5.2)

27 2.3. THE PROM FRAMEWORK 13 joint activities and the main idea is sustained in the profile concept, which determines the subset of tasks performed by each actor in the network. This technique can generate a role hierarchy based on the different activities performed by actors. A directed arrow between two actors/groups indicates that the actor/group at the base of the arrow can do at least the activities performed by the actor/group at the arrow head [15] Semantic Organizational Miner The aim of this technique is to discover groups of users that work together based on task similarity. Tasks are considered to be similar whenever they are instances of same concepts Staff Assignment Miner Staff assignment rules define who is allowed to do which tasks. This technique mines and compares the real staff assignment rules with the staff assignment rules defined for the underlying process afterwards. Based on this comparison, possible deviations between existing and mined staff assignment rules can be automatically detected [22]. 2.3 The ProM Framework The work developed in this dissertation was implemented as a plug-in for the ProM Framework 1 [21, 27]. ProM is a powerful tool aimed at process mining in all the perspectives (process, organizational and case perspective). This framework is issued under an open-source license and extensible, i.e., it has been developed as a completely plugable environment. Currently, more than 280 plug-ins have been included. The most relevant plug-ins for this work are mining plug-ins. Figure 2.5 presents an overview of the architecture of ProM showing the relations between the framework, the plug-ins and the event log. The event log that usually is used as input to the plug-ins is in Mining XML (MXML) format, which is a specific format based on XML and specially designed for this framework [28]. Each Process-Aware Information Systems (PAIS) has its own log file format, which difficult the use of process mining tools, because every time we want to use an event log as an input, we first need to convert it to a format supported by the process mining tool. This not only requires knowledge of the PAIS event log format but also the process mining tool event format. To make things easier, developers of ProM decided to create MXML. This format follows a specified schema definition, which means that log does not consist of random and disorganized information; it rather contains all the elements needed by the plug-ins at a known location [17, 25, 26]. Figure 2.6 represents MXML format and Figure 2.7 a snapshot of a MXML log. The process log starts with the WorkflowLog element that contains Source and Process elements. The Source element refers to the information about the software or the system that 1 For more information and to download ProM visit

28 14 CHAPTER 2. MINING THE ORGANIZATIONAL PERSPECTIVE Figure 2.5: Overview of the ProM Framework (adapted from [27]) was used to record the log. While the Process element represents the process to which the process log belongs to. In mean while, Process element is made up of several audit trail entries. An audit trail corresponds to an atomic event and records information such as: WorkflowModelElement (refers to the activity the event corresponds to), EventType (specifies the type of the event), Timestamp (refers to the time the event occurred), and Originator elements (individual that performed the activity). [17, 25, 26] As shown the (Fig. 2.5), Process-aware Information Systems (PAIS) generate these event logs and the Log Filter is used to read the logs only if it is necessary to filter them before perform any other task. As (Fig. 2.5) shows, the ProM framework allows five different types of plug-ins [7, 27]: Import plug-ins a wide variety of models can be loaded ranging from a Petri net to ITL formulas. Mining plug-ins which implement some mining algorithm, e.g., mining algorithms that construct a Petri net based on some event log. The results are stored as a Frame. Analysis plug-ins typically implement some property analysis on some mining result. For example, for Petri nets there is a technique which constructs place invariants, transition invariants, and a cover ability graph. Conversion plug-ins take a mining result and transform it into another format, e.g., from EPCs to Petri nets. Export plug-ins which implement some save as functionality for some objects (such as graphs). For example, there are plug-ins to save EPCs, Petri nets, spreadsheets, etc.

29 2.4. CONCLUSION 15 Figure 2.6: MXML format (adapted from [8]) All mining techniques for social network analysis described in Section 2.2 are available in ProM. 2.4 Conclusion In this chapter we have introduced the main key concept of process mining - Log file. We have explained which type of information about business processes is stored in log event, and how this information is used to derive meaningful sociograms in organizational perspective. A set of metrics have been developed with the porpuse to to establish relationships among individuals from log events information. We have also discussed a set of techniques developed for social network mining. Finally, we have introduced the framework used through this dissertation, and the log file format used on its process mining techniques.

30 16 CHAPTER 2. MINING THE ORGANIZATIONAL PERSPECTIVE <?xml version= 1.0 encoding= UTF 8?> <WorkflowLog xmlns:xsi= org /2001/XMLSchema instance xsi:nonamespaceschemalocation= WorkflowLog. xsd d e s c r i p t i o n = Test log f o r d e cision miner > <Source program= name:, d e s c :, d a t a : {program=none} > <Data> <A t t r i b u t e name= program >name:, d e s c :, d a t a : {program=none}</ A t t r i b u t e> </Data> </Source> <Process id= 0 d e s c r i p t i o n = > <AuditTrailEntry> <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> <EventType>s t a r t</eventtype> <Timestamp> T 0 9 : 5 2 : : 0 0</Timestamp> <Originator>Robert</Originator> </AuditTrailEntry> <AuditTrailEntry> <Data> <A t t r i b u t e name= Amount >500</ A t t r i b u t e> <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> <A t t r i b u t e name= PolicyType >Normal</Attribute> </Data> <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> <EventType>complete</EventType> <Timestamp> T 1 0 : 1 1 : : 0 0</Timestamp> <Originator>Robert</Originator> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>Check p olicy only</workflowmodelelement> <EventType>s t a r t</eventtype> <Timestamp> T 1 0 : 3 2 : : 0 0</Timestamp> <Originator>Mona</Originator> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>Check p olicy only</workflowmodelelement> <EventType>complete</EventType> <Timestamp> T 1 0 : 5 9 : : 0 0</Timestamp> <Originator>Mona</Originator> </AuditTrailEntry> <AuditTrailEntry> <WorkflowModelElement>Evaluate claim</workflowmodelelement> <EventType>s t a r t</eventtype> <Timestamp> T 1 1 : 2 2 : : 0 0</Timestamp> <Originator>Linda</Originator> </AuditTrailEntry> <AuditTrailEntry> <Data> <Attribute name= Status >approved</attribute> </Data> <WorkflowModelElement>Evaluate claim</workflowmodelelement> <EventType>complete</EventType> <Timestamp> T 1 1 : 4 7 : : 0 0</Timestamp> <Originator>Linda</Originator> </AuditTrailEntry> </ProssInstance>.... </Process> </WorkflowLog> Figure 2.7: MXML snapshot

31 Chapter 3 Social Network Analysis In the previous chapter we have explained how to obtain the data for creating the sociograms. After having a sociogram we are able to start social network analysis (SNA). We start explaining broadly what SNA consists of and the value and benefits it brings to the business. With SNA tools, several techniques can be applied to analyze social networks and make conclusions both at the individual level (i.e., analyze each node individually and derive relationships between individuals) and the entire network. Finally we will discuss a very common characteristic of social networks - existence of community structures [8]. The identification and study of these structures can be helpful in SNA, especially the ones with large dimensions. 3.1 Social Network Analysis (SNA) In a very competitive market it is crucial for organizations to have access to knowledge and information, preferably before than other organizations, because unique and valuable information can guarantee a good competitive advantage. Therefore the acquisition of information allows the organizations to improve the performance of the strategic business process. Communication among people is not only important because it allows the spread of information but it is also the key factor to the creation of innovation and consequently creation of value to the organization. All organizations establish a formal social structure where all the hierarchy relationships between employees are defined. However, in most cases, the relationships that really exist in the organizations have nothing to do with the structure previously defined[6]. Social network analysis (SNA), which is the analysis of social networks in the organizational perspective, plays a very important role since it evaluates the relationships among people, teams, departments or even entire organizations[6]. This kind of analysis can achieve important information to improve the flow of communication inside an organization and allows the managers to discover the way work is being done in the informal way. The main goal of SNA is to turn the communication process completely transparent and provide tools to turn all the process of communication better and fluent. 17

32 18 CHAPTER 3. SOCIAL NETWORK ANALYSIS All SNA techniques rely all in graphic representation, thus a social network is represented as graph, where each node is a person and each link between two nodes is a relationship[3, 16]. 3.2 SNA Measures After generating a social network as a graph (sociogram), it is necessary to define measures to perform SNA, so that it is possible to make a comparison among actors or networks. Measures in SNA can be separated in the ones that evaluate the entire network and the ones that only evaluate a specific node [11, 26]. Further we will list and explain some of the existing measures Measures for an individual level. When analyzing a specific individual (i.e., a node in the graph) it is needed to determine his role and influence in the network, i.e., to know if the individual is a leader or is isolated from the rest of the network, to know if it is a crucial link enabling the connection between two other individuals. There are many notions about individual that can be taken. To do so, we explain some of the metrics that are usually used to accomplish these notions. Degree: The Degree of a node (sometimes called Degree Centrality) is number of nodes that are connected to it. This measure can be seen as the popularity of each actor. If a directed graph is being used, the single degree metric would be split into two metrics: (1) In-Degree which measures the number of nodes that point toward the node of interest, and (2) Out-Degree, which measures the number of nodes that the node of interest points toward. Betweenness Centrality: This measure computes the influence that a node has over the spread of information through the network. In social network context, a node (i.e., person) with high betweenness centrality value means that it performs a crucial role in the network, because this person enables the connection between two different groups. If this node is the only bridge linking these two groups and for some reason this node is no longer available, the change of information and knowledge between these two groups would be impossible. Closeness Centrality: This measure computes how close each node is to the other nodes in the network. Unlike other centrality metrics, a lower Closeness Centrality value indicates a more central (i.e., important) position in the network. In social network context, this means that a node (i.e., person) with a higher closeness centrality value, to get through the node it wants, it will need to contact a lot of nodes in its ways. One the other hand, a another node with a lower closeness centrality value is able to contact the same node with fewer steps. Therefore, the last case is the best to monitor the information flow in the network as it has the best visibility into what is happening in the network.

33 3.2. SNA MEASURES 19 Eigenvector Centrality: This measure is similar to the Degree since it counts how many connections a node has. But this metric goes further and has in consideration the Degree of the vertices that are connected to it. In social network context, two nodes can have the same degree value; however one of them can be connected with nodes that have important roles in the network. Thus this node will have a higher Eigenvector Centrality value than the other node. Clustering Coefficient: This measure determines node s capacity to cluster together. To do this, it is necessary to determine how close node s neighbours are to being a clique. By clique we understand a network where all possible connections exist, i.e., in a network with 4 nodes and undirected links, it would be a clique if it had 6 links; all nodes are directly connected with each other. More specifically, the Clustering Coefficient, it is the number of links connecting node s neighbours divided by the total number of possible links between node s neighbours. In social context, a node with high clustering coefficient means that it is much embedded in the network, while a node with low coefficient means that it is a peripheral node and more disconnected from all nodes. The peripheral nodes have lack of new knowledge and information Measures for the network level The metrics above are restricted to a single individual. But when doing network analysis it is also necessary to make some conclusions about the whole network, i.e., to determine the capacity of the network to be separated into smaller sub-networks (clusters), to determine if the network is sparse or dense. In order to know this kind of information, we explain some of the metrics that are usually used to accomplish these notions. Density: The value of this measure ranges between 0 and 1 indicating how interconnected the vertices are in the network. In the social context, a dense network means that everyone communicates with everyone. The density is defined as: Density = n N 2 (3.1) where n represents the links that there are in the network and N represents the maximum number of possible links. Clustering coefficient: This metric determines the probability of a network to be partitioned into a finite number of sub-networks. In the social context, a new cluster is seen as a new team/group in the organization. Centralization: This measure is directly connected to the individual notion of centrality, explained in the previous section. The lower the number of Central nodes on the network, the higher is the centrality of a network. In the social context, high centralized network is dominated by one or a few persons. If from some reason this person is removed, the network quickly breaks into unconnected sub-networks. A

34 20 CHAPTER 3. SOCIAL NETWORK ANALYSIS highly central network is not a good sign because it means that it has critical points of failure, putting too much trust and power in a single individual. 3.3 Finding community structures in networks SNA relies a lot in graphic visualization, all SNA algorithm s output is a graph representing the network. The measures presented above are crucial to analyze the network; however the analysis process must be complemented with a graphical analysis. When dealing with networks of large dimensions it is difficult and complex to make a SNA. Facing this problem many algorithms have been developed to identify communities (sub-networks) in the network, adopting a divide-to-conquer technique. The algorithms developed are a merge of clustering algorithms and the SNA measures discussed in Section 3.2. The problem finding good divisions of networks has a long history. For good divisions, we mean finding the most natural sub-groups of a network and the most similar to the groups of the real social structure. We will present, according to the evolution over time, some of the most important clustering methods used to detect community structures in networks. But first we need to make clear what a community structure is. Community structure definition Social networks have been studied for quite a while, in fields ranging from modern sociology, anthropology, social psychology, communication studies, information science, to organizational studies as well as Biology. The general notion of community structure in complex networks was first pointed out in physics literature by Girvan and Newman [9], and refers to the fact that nodes in many real networks appear to group in distinct subgraphs/communities. Inside each community there are many edges among nodes, but among communities there are fewer edges, producing a structure like the one shown in Figure 3.8. Density of edges within communities is dense and among communities is sparse. To better understand how Girvan and Newman got to this definition we will introduce an important theory, developed in sociology area - Strong and Weak Ties theory [10, 16]. Strong and Weak Ties is a theory authored by Granovetter [10, 16] where the author argues that within a social network, weak ties are more powerful than strong ties. In a social network, the strength of a tie that links two individuals may range from weak to strong depending on the quantity, quality and frequency of exchanges between actors. Stronger ties are characterized by increased frequency of communication and more intimate connections between individuals, for example, stronger ties exists among close friends, family members, workers of one specific department. On Weak ties, by contrast, more limited investments of time and intimacy are implicit, resulting an array of social acquaintances, for example, weak ties are common among different departments. Granovetter [16] defends that Strong ties are considered more useful in facilitating the flow of information between individuals. Weak ties, on the other hand, are of greater

35 3.3. FINDING COMMUNITY STRUCTURES IN NETWORKS 21 Figure 3.8: Network with community structure. In this case there are three communities (represented by the dashed circles) composed by densely connected vertices. Links of lower density (depicted with thinner lines) are the ones that establish a connection between the different communities importance in encouraging exchange of a wider variety of information between groups in an organization. People with few weak ties within a community will become restricted from receiving new information from outside circles and will be resigned to hear the same re-circulated information. For this reason weak ties are more powerful than strong ties, because they able to spread new information, innovation and consequently bring value to the company. [12, 16] This theory idealizes the social network as group of communities, where a community is a set of individuals with dense and strong ties between them, i.e., individuals with a high level of intimacy, while connections between communities should be sparse and weak. Now that the meaning of community structure is clear we are able to introduce the algorithms developed to identify this structures. The algorithms are presented in two main groups: Traditional Approaches, where we introduce the beginning of clustering algorithms and Recent Approaches, where we introduce the most recently discovers in this subject. The algorithms described as follows, assume that the network is the most simple possible, i.e., there are undirected and unweighted links between nodes Traditional Approaches It is commonly stated by literature [8, 9, 19] that the most traditional approaches for this problem have origin in two main fields: Computer Science, which has created the idea of graph partitioning; and Sociology which has created the idea of hierarchical clustering.

36 22 CHAPTER 3. SOCIAL NETWORK ANALYSIS Computer Science Approaches Graph partition [9] is a top-down approach based on interactive bisection. This kind of algorithm finds the best division of the network in two groups. I it is necessary to divide the social network in more than two groups, each one of the groups generated in previous interaction is divided into two new groups. The subdivision is repeated until we have the required number of groups. The main disadvantage of this approach is that it only divides the network into two groups and not in an arbitrary number of groups. For example, if we want to divide the network into three clusters the algorithm will first divide the network into two clusters and then divide one of the two clusters in two new clusters, performing at the end three clusters. This approach does not guarantee that this is the best division, and the results produced are far from satisfactory. Sociological Approaches Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with a cluster for each element of the network and end with a cluster containing all the members in the network. In Agglomerative algorithms each iteration merges two of the existing clusters, so that we have one cluster less. Divisive algorithms work on the opposite way. They start with a single cluster containing all the elements of the network and end with one cluster for each element. In this kind of algorithm each iteration divides the existing cluster in two, so that at the end we will have a cluster for each element in the network. Most of the organizational models are hierarchical, thus for the purpose to find communities in networks the agglomerative is more used than the devise algorithms. Given a network with a set of N nodes, the basic process of an agglomerative hierarchical clustering is the following: 1. Each node is assigned to a cluster (if there are N nodes, will exist N clusters, each containing just one item). In this step the distances (similarities) between the clusters are the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now there is one cluster less. Compute distances (similarities) between the new cluster and each of the old clusters. 3. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Step 3, the distance between two clusters, can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering. Single Linkage Single linkage, also known as the nearest neighbour technique defines similarity between two clusters as the distance between the closest pair of elements of those

37 3.3. FINDING COMMUNITY STRUCTURES IN NETWORKS 23 clusters. In other words, the distance between two clusters is given by the value of the shortest link between the clusters. In the single linkage method, D(r, s) is computed as D(r, s) = Mind(i, j), where element i is in cluster r and element j is cluster s. This technique computes all possible distances between objects from clusters r and objects from cluster s. The minimum value of these distances is known as the distance between clusters r and s. At each stage of hierarchical clustering, the clusters r and s, for which D(r, s) is minimum, are merged. Complete Linkage This method is similar to the previous one, but instead of considering the minimum value, it considers the maximum value. Complete Linkage computes the distance between two clusters as the distance between the two most distant elements in the two clusters. Average Linkage Here the distance between two clusters is defined as the average distance from all distances between two clusters. In the average linkage method, D(r, s) is computed as D(r, s) = T rs/(nr Ns). Where T rs is the sum of all distances between cluster r and cluster s. Nr and Ns are the sizes of the clusters r and s respectively. At each stage of hierarchical clustering, the clusters r and s, for which D(r, s) is the minimum, are merged. The main disadvantage of this approach is that it usually fails finding the rights communities in the networks when the real structure is known, which makes difficult to rely in this algorithm in other cases. Another disadvantage is that it tends to neglect the peripheral nodes and find only the cores of the communities. The nodes that are in the core of the network tend to have stronger similarity between them, so the agglomerative algorithm will tend to cluster early these nodes Recent Approaches Trying to address the problems of the two approaches above and due to the emerge of more complex networks such as Internet, the World-Wide-Web and , some efforts have been made and new approaches have emerged. Almost recent approaches rely in the hierarchical clustering; however each approach tries to make an improvement of the algorithm applying some SNA measures discussed in refsecond:2. For example, the algorithm of Girvan and Newman [9, 19] is one of the recent algorithms that can find the most similar communities when comparing with the real community structure, giving us the most satisfactory results. This is a divisive method based on the removal of nodes, i.e., it believes that removing the nodes with high betweenness will split the network into its natural communities. The nodes with high betweenness

38 24 CHAPTER 3. SOCIAL NETWORK ANALYSIS can be imagined as being a bridge among communities, and so, the boundaries between communities. Although the results are very satisfactory, the performance of the algorithm when dealing with networks of big dimensions is very poor. The algorithm is very heavy, because every time that a node is removed from the network, we need to evaluate the new value betweenness of each node. As an effort to overcome this performance issue, the algorithm of Tyler [20] (used in studies of networks) was developed. Although they made an algorithm faster, the accuracy of the results reduced. Along the years some algorithms such as: the algorithm of Radicchi [19], the algorithm of Wu and Huberman [19], have appeared to address the limitations of the previous algorithm. Although they are faster, the results are poor and worse than Girvan and Newman Algorithm [19]. All the algorithms mentioned above, both Traditional and Recent Approaches, have drawbacks. Although each one is an attempt to address and overcome the issues of the previous one, there is a disadvantage that is common to all approaches and it is drawing all attentions. The problem is that none of the algorithms, gives a guide to how many communities a network should be split into. To address this problem, performance and accuracy, recently a new concept has emerged - Modularity [19, 20]. The authors of the algorithm of Girvan and Newman[20] developed this concept when they were faced with the handicap of the algorithm do not provide any hint about how many communities should be split. Modularity is a quality measure for graph clustering. It measures if a specific division of a network into a group of communities is good or not, in the sense that the connections inside a community are dense and the connections between communities are sparse. Further, in Chapter 5.3 modularity will be explained in more detail. 3.4 Conclusion In this chapter we have introduced Social Network Analysis, one of the three perspective of process mining and the one where our dissertation is focused. The kind of information that can be depicted with this analyze was also explained in this chapter. Social Network Analysis is not only able to extract information about each individual in the network but also about the entire network. We have also presented some algorithms that can derive meaningful sociograms.

39 placeholder

40 placeholder

41 Chapter 4 Proposed Approach In this chapter we analyze the existing social mining tools available in the market and compare it with ProM. This way we attempt to address the major challenges present in social mining tools, in particular ProM. We will present a proposal to overcome the challenges found. In this chapter we also make clear why we decided to develop a plug-in in ProM framework rather than another framework. 4.1 Motivation Nowadays there is paraphernalia of software for Process Mining with Social Network analysis tools, some of them are: NetDraw 1, Pajek 2, NetMiner 3, UCINET 4, MultiNet 5, ProM among many others. To better understand the advantages and limitations of ProM it was made a comparison with three of the most famous open-source software. Figure 4.9 illustrates this comparison showing the different phases supported by software packages during a social network analysis. The goal of process mining is to extract information about processes from event logs. Nowadays, most of the Process-aware Information Systems (PAIS) generate the event logs. However each of them has its own data structure, and its own language to describe the internal structure. When trying to use events from different system to do process mining, we need to be able to present logs in a standardized way, so that the software for SNA that we are using can process and analyze the event logs [7]. One of the main advantage of ProM over other software is that it can do the mapping between the metamodel of widely-used information systems to his own meta-model - MXML. Unfortunately, all the other software do not establish this direct connection with PAIs. If the user wants to do a social network analyze through one log, the user has to do the mapping ttp://

42 28 CHAPTER 4. PROPOSED APPROACH Figure 4.9: Comparison of the different phases supported by the ProM and other softwar packages during a social network analysis between the meta-model of the PAI to the input format supported by the software. This process requests knowledge of both meta-model [7, 13]. Another advantage of ProM over other tools is that while others only represent the data in a graphic way (sociogram) and determine some of the Social Network Analysis Measures described in Section 3.2. ProM offers many mining tools that not only can determine SNA measures but also can derive meaningful sociograms from the event logs. These mining techniques were described in Section 2.2. ProM is the most complete Process Mining tool since it is able to mine all three perspectives (process, organizational and case). However in ProM becomes very challenging to analyze large networks. One of the requirements of process mining is a graphic representation of data. Data represented in a graphic, dot or social networks becomes much easier to analyze the data. In this way ProM is a bad tool. Figure 4.11 shows a large network represented by ProM, as we can see it is a confusing representation of the data and the image is static, i.e., the user can not manipulate the image (move nodes away, re-arrange the positions of nodes, etc). 4.2 Proposal In previous section we have shown that ProM is a more complete framework than the others enumerated, however it has poor and limited visualization capabilities. As shown in Section 2.2, ProM has many plug-ins available for the organizational mining perspective, but most of these plug-ins make analysis at the individual level, enabling the user to detect communities or groups inside the network. For example, Working Together plug-in, ProM tells us that originatora works with originatorb, but cannot give us any information about teams, for example: how many teams exist in the company and how different teams interact and are connected with one another. This way our proposal attempts to overcome both issues presented above. The main goals of our proposal are: 1. we intend to develop ProM plug-ins for the organizational perspective, making possible to identify groups/communities of originators in the social network; 2. provide ProM with advanced visualization capabilities so that it becomes easier to

43 4.2. PROPOSAL 29 analyze and have more interesting and richer outcomes. Next subsections will explain how we intend to achieve our goals Application of Agglomerative Hierarchical Clustering in SNA In Section 3 we have discussed some of the most important traditional and recent clustering methods used to detect community structures in networks. From those we have decided to implement Agglomerative Hierarchical Clustering (AHC) from different reasons that we will explain now. All networks generated by mining tools of ProM are weighted graphs, and the analysis process has in consideration the weights of the links. This information brings value to the analysis since it makes the analysis richer and more interesting. The weights of links are extracted from information in event logs; this information can be, for example, how many times two originators work together or how many tasks two originators perform in common. The weights of links represent the power of relationship between two nodes, i.e., how frequently they work together. Recent algorithms, discussed in Section 3.3.2, were made for simple networks, undirected and unweighted. If we were adopting one of the recent algorithms we could not have weighted links and would be wasting important and crucial information for network analysis. Recent algorithms do not exploit the information on event-logs and they do not apply any mining tool neither use metrics like the ones discussed in section 2.1 They only map the information into a graph and then determine the communities based on SNA measures (most of them use betweenness and degree measures) or distance measures such as the Euclidean and Hamming distance. Due to these limitations, we excluded recent approaches. We also excluded traditional approaches based on graph partitioning because of the disadvantages of this technique mentioned above. Therefore we decided to choose Hierarchical clustering, which is the fundamental base of the most algorithms to find communities. This approach allows us to analyze a gradual agglomeration of the nodes into communities, starting from the individual perspective to the organizational perspective. Although we have implemented Agglomerative Hierarchical Clustering, we did it with some improvements and adjustments: 1. The first adaptation of the algorithm consists in using the power of the relationship between nodes to determine if they belong to the same cluster. If two actors: actor A and actor B work together in five cases, and actor A and actor C work together only in two cases, than the relationship between actor A and actor B is stronger than the relationship between actor A and actor C [16]. 2. We will add the concept of modularity to the algorithm. Our algorithm will determine the modularity to each division so that we can know which one has the highest quality (the one with the highest value of modularity). The modularity of each division will be shown in a chart and the best one will be highlighted.

44 30 CHAPTER 4. PROPOSED APPROACH Given a network with a set of N nodes, our plug-in is the following: 1. Each node is assigned to a cluster (if there are N nodes, will exist N clusters, each containing just one item). In this step the distances (similarities) between the clusters correspond to the power of the relationship of the nodes they contain; 2. Then we will search for the most powerful relationship between two clusters and merge them into a single cluster, so that it will be one cluster less; (a) If there are several candidates, i.e., more than a couple of clusters with the most powerful relationship, we decide which candidates to agglomerate based on two options: (1) we choose the last couple of clusters found, or (2) we choose the couple of clusters that maximizes the modularity. 3. Compute distances (similarities) between the new cluster and each of the old clusters. For this step we may use one of these methods: single-linkage, complete-linkage or average-linkage; 4. We determine the value of modularity for this number of clusters; 5. Repeat steps 2, 3 and 4 until all items are clustered into a single cluster of size N. Adopting this approach it will help us to achieve both of our main goals. The first goal will be achieved because Agglomerative Hierarchical Clustering (AHC) is widely used to identify teams/groups/communities in the network. And the second goal will also be achieved because in some way AHC reduces the size of the network. AHC allow us to analyze the network at the individual level (first iteration), i.e., the relationships between originators; and also allow us to analyze the network at the organizational level, i.e., identifies communities and shows the relationships between those communities. The ideal was to implement our proposal in all five tools discussed in Section However, we decided to apply our proposal only in two algorithms: Working Together and Similar Tasks. We will now explain why we have chosen these algorithms. Working Together Algorithm From all social mining tools in ProM, discussed in Section 2, only one uses Agglomerative Hierarchical Clustering (AHC) approach [19], and is based on joint activities. This algorithm assumes that actors performing similar tasks have high probability of working together. However this is not entirely true. For example, people of an organization working in departments such as financial, accounting, marketing and manufacturing departments probably perform common tasks and consequently have similar profiles. However, nothing assures that they work together. Our main goal from the beginning was to identify groups/communities of originators in the social network. We must have sure that people in the same community definitely work together. To address this issue, we need to focus on the cases and not in the activities. Thus, our purpose is to provide the Working Together mining tool of ProM a functionality that can analyze a network in a progressive way, i.e., see the gradual agglomeration of nodes into groups (bottom-up approach).

45 4.2. PROPOSAL 31 Figure 4.10: Output of Prom 5.2 using the Working Together mining tool applied on a small network. In this case we used DecisionMinerLog.xml supplied by ProM 5.2 ProM already has a Working Together algorithm; however this algorithm is unable to identify communities. Figure 4.10 shows a social network obtained by Working Together Algorithm from ProM, and as we can see in the figure, this algorithm identifies two distinct groups, but we only get these two distinct groups because all the people in log file work in disjoint cases. The log file has not a single case where the same person belongs to two different teams. In (Fig. 4.11) this does not happens because everyone works with everyone in at least one case, resulting in a network with no distinct groups. In this case, the Organizational Model, ProM determines that there is only one group of work. So we can conclude that the algorithm Working Together is only helpful when there are disjoint teams from the beginning (in the log). Networks of big dimensions are hard to visualize in a single view, therefore meaningful substructures have to be identified, so that can be visualized separately. Similar Tasks Algorithm ProM already has a Similar Tasks Algorithm, however this algorithm is unable to identify communities. This algorithm has the same behaviour and limitations, as working together algorithm explained just above, when detecting communities. It is only capable of detecting teams when there are disjoint teams from the beginning (in the log), in other case the algorithm will demonstrate that all originators belong to the same community. Another reason why we have chosen this algorithm is because this was the only way that we could prove that modularity concept really works. This issue is further explored in Chapter 5.

46 32 CHAPTER 4. PROPOSED APPROACH Figure 4.11: Output of Prom 5.2 using the Working Together mining tool applied on a large network. In this case we used outpatientclinicexample.mxml supplied by ProM 5.2. It is relevant to say that the mining result image is just a tiny part of the real network Displaying social networks As we have already said in previous section, Agglomerative Hierarchical Clustering will help us to achieve our second main goal. However, this is a help and not the complete solution for the problem. Mining analyzes rely a lot in graphical representation of data, so it is crucial to make a significant investment in a good tool to represent the data, which allows the user to interact with the graph and allows the user to manipulate and re-arrange the graph as he wants. Thus, we have adopted a new tool in the market, JGraph 1. With this tool the user can observe a dynamically change of the network and observe the progressive clustering of the elements of the network. The user also can manipulate and re-arrange the network as he wants. We have also developed some features that helps the analyzing like for example change the thickness of the links according to the weight of the link, i.e., the greater the weight of the link, the greater the thickness that is represented. Figure 4.12 is an example of what can be done using JGraph. One of the big challenges of process mining is to be able to represent such amount of information more user friendly. Most common forms of representing and analyzing social networks are through (1) descriptive methods (e.g. text or graphs); (2) analysis procedures often based on matrices operations presented in data files with proper formats or ontological representations; (3) 1 For more information and to download JGraph visit

47 4.2. PROPOSAL 33 Figure 4.12: Social Network of the insuranceclaimhandlingexample.mxml. This screen-shot corresponds to the 1st iteration of Working Together AHC Algorithm, using tie break with modularity. At this point each cluster corresponds to a single originator.

48 34 CHAPTER 4. PROPOSED APPROACH statistical models based on probability distributions. One reason for using mathematical and graphical techniques in SNA is to represent the descriptions of networks compactly and systematically [14]. In the beginning of the implementation we thought that a graphical representation of the network would be enough, however very dense graphs with big dimension, no matter how good the design tool is, tend to be impractical to manipulate and analyze the graph. We were able to identify this challenge because our real-world case was from a Portuguese Hospital Emergency, and logs from these kind of institutions are very large and complex. To overcome this issue, we decided that graphical representation should be complemented with matrix representation, this way, when it gets too difficult and impractical to analyze/see the relationships trough the graph, user can resort to the matrix representation. Figure 4.13 shows the matrix of the social network depicted by Figure We also have developed a feature that allows the user to analyze each cluster individually, either in graphical and matrix representation. Figure 4.13: Matrix showing the relationships among originators of the social network depicted in Figure Conclusion In this chapter we have introduced some of the tools for process mining available in the market and have established a comparison between the existing tools and ProM. This analysis was very helpful to identify not only the advantages of ProM but also to identify its handicaps and disadvantages. This way we were able to construct our proposal with the intention to make ProM a more powerful and user friendly tool. We explain also how we intend to achieve our main goals.

49 placeholder

50 placeholder

51 Chapter 5 Implementation in ProM In this chapter we describe how the proposed technique was implemented as a plugin for ProM. This plug-in will be called Organizational Cluster Miner. First we explain which information and how we extract it from the Log file to do our analysis. Based on the information extracted, we will then explain how we compute the Hierarchical Clustering Algorithm. Finally, we explain the modularity concept and how to compute it. We will first explain more accurately the Working Together analysis. Then we will explain Similar Tasks analysis; implementation is very similar to the Working Together analysis, the only difference is the information that is extracted from the log file. Through this chapter we will use a simple Log file, which is available with ProM v5.2 (insurance- ClaimHandlingExample.mxml) [1]. 5.1 Extracting information from Log file Process mining framework ProM, uses a standard XML format, known as Mining XML (MXML). To understand this chapter and how we extract the information necessary to do Working Together analysis, it is important to recap some of the principal concepts of this format. A process log consists of several instances or cases, each of them may be made up of several audit trail entries. An audit trail corresponds to an atomic event and records information such as: WorkflowModelElement (refers to the activity the event corresponds to), EventType (specifies the type of the event), Timestamp (refers to the time when the event occurred), and Originator elements (individual that performed the activity). Working together analysis focuses on cases and derives case-based structures. The metric counts how frequently two originators are performing activities for the same case. If individuals work together on the same cases, they will have a stronger relationship than individuals rarely working together. Table 5.2 shows the information contained in insuranceclaimhandlingexample log. The first column refers to the Case ID, the second one to the WorkflowModelElement, the third one to the EventType, the fourth to the user generating the event and the last shows a time stamp. 37

52 38 CHAPTER 5. IMPLEMENTATION IN PROM When looking at a log file, two originators work together if they perform activities (WorkflowModelElement) in the same case. For example, in the log shown in Table 5.2, for the first case, we can assume, that John, Fred, Robert and Howard work all together, because all perform activities in this case. So we can conclude that all of them work with one another at least once. Table 5.3 shows how many times the originators work with one another. Each cell indicates the power of the relationship between two originators. The power of the relationship corresponds to the number of times these two originators work together. The more two individuals work together, the greater the power of the relationship is. For example, in Table 5.3 cell [John, Howard] has value 2 because both perform activities in two cases (case 1 and case 3). Now, we already have all the information we need to do an Working Together analysis. In Section 1.1 we will explain how we compute Agglomerative Hierarchical Clustering using the information extracted from the log file. Case id Activity Event Originator Date Register Claim start John :55:00 Register Claim complete John :59:00 Check all start Fred :56:00 Check all complete Fred :00:00 Evaluate claim start Fred :01:00 1 Evaluate claim complete Fred :09:00 Send approval letter start Robert :45:00 Send approval letter complete Robert :05:00 Issue payment start Howard :33:00 Issue payment complete Howard :01:00 Archive claim start Robert :56:00 Archive claim complete Robert :56:00 Register Claim start Mona :52:00 Register Claim complete Mona :59:00 Check all start Robert :12:00 Check all complete Robert :56:00 2 Evaluate claim start Fred :02:00 Evaluate claim complete Fred :39:00 Send rejection letter start John :52:00 Send rejection letter complete John :03:00 Archive claim start John :52:00 Archive claim complete John :59:00 Register Claim start Robert :52:00 Register Claim complete Robert :59:00 Check all start Mona :12:00 Check all complete Mona :33:00 Evaluate claim start Fred :52:00 3 Evaluate claim complete Fred :12:00 Send approval letter start Fred :12:00 Send approval letter complete Fred :32:00 Issue payment start Howard :52:00

53 5.1. EXTRACTING INFORMATION FROM LOG FILE 39 Case id Activity Event Originator Date Issue payment complete Howard :09:00 Archive claim start Robert :22:00 Archive claim complete Robert :56:00 Register Claim start Robert :52:00 Register Claim complete Robert :11:00 Check policy only start Mona :32:00 Check policy only complete Mona :59:00 Evaluate claim start Linda :22:00 4 Evaluate claim complete Linda :47:00 Send approval letter start Linda :52:00 Send approval letter complete Linda :12:00 Issue payment start Vincent :25:00 Issue payment complete Vincent :36:00 Archive claim start Mona :52:00 Archive claim complete Mona :23:00 Register Claim start Mona :52:00 Register Claim complete Mona :27:00 Check policy only start Howard :52:00 Check policy only complete Howard :05:00 Evaluate claim start Linda :17:00 5 Evaluate claim complete Linda :43:00 Send rejection letter start Vincent :52:00 Send rejection letter complete Vincent :12:00 Issue payment start Vincent :09:00 Issue payment complete Vincent :23:00 Archive claim start Mona :42:00 Archive claim complete Mona :13:00 Register Claim start Robert :43:00 Register Claim complete Robert :06:00 Check all start John :32:00 Check all complete John :13:00 Evaluate claim start Linda :46:00 6 Evaluate claim complete Linda :57:00 Send rejection letter start Linda :59:00 Send rejection letter complete Linda :01:00 Archive claim start Linda :33:00 Archive claim complete Linda :56:00 Table 5.2: Table representing the information in insurance- ClaimHandlingExample.mxml event log.

54 40 CHAPTER 5. IMPLEMENTATION IN PROM Fred Howard John Linda Mona Robert Vincent Fred Howard John Linda Mona Robert Vincent Table 5.3: Information extracted from the Log file (insuranceclaimhandlingexample.mxml). This matrix shows the existing links among vertices. 5.2 Agglomerative Hierarchical Clustering Our plug-in uses Agglomerative Hierarchical Clustering. The main reason why we have chosen agglomerative method instead of devise method it is that information stored in event logs is referent to a single originator, i.e., event logs store the tasks performed by each individual at a specific time. Since we only have information at performers level, to identify communities, we must start from the single individual, and proceed to successive agglomeration of individuals until we have a community. Our approach of Agglomerative Hierarchical Clustering has already been explained in detail in Chapter We will now make clearer some aspects of Step 2 of our algorithm. First, it is important not to forget that in our approach, the more two individuals work together, the greater the power of the relationship is and the minimum the distance between them is. Second, in case of tie, i.e.: if in one iteration of the Agglomerative Hierarchical Clustering, the algorithm finds more than one pair of clusters with the same similarity, the algorithm will agglomerate the last pair found. All Hierarchical Clustering Algorithms have the disadvantage of not giving a hint of how many communities a network should be split into. In case of the Agglomerative approach, the algorithm iterates from one element per cluster to a cluster containing all the elements, and the user does not know, which of the several iterations is the best one, the one that matches the reality. To address this problem we adopted a concept that recently has emerged - Modularity [19, 20].This concept also works to make our process of tie break more accurate and precise. Instead of choosing the last found couple of clusters to group, the algorithm will choose the one that maximizes the modularity value. In the following section we will explain this concept. Figure 5.14 and Figure 5.15 are screen-shots of our plug-in showing the outcomes of Working Together AHC. Our plug-in allows the user to see the social network in two different perspectives: (1) perspective at the individual level, as we can see in Figure 5.14, and (2) perspective at the organizational level, as we can see in Figure In the first perspective, named Social Network, derives a flat model which gives

55 5.2. AGGLOMERATIVE HIERARCHICAL CLUSTERING 41 Figure 5.14: Social Network of the insuranceclaimhandlingexample.mxml. This screen-shot corresponds to the 3rd iteration of Working Together AHC Algorithm, using tie break with modularity. Here the relationships among originators of the social network are represented. Originators from the same cluster are represented by the same colour. an individual perspective where the user can observe and analyze the relationships between originators and which originators belong to the same community/cluster. Each originator is depicted as node in the graph labelled with the name of the originator. The power of a relationship between two originators is depicted by the label, and if two originators are drawn by the same colour then they belong to the same community/cluster. For example, originators John, Robert and Fred are represented with the colour yellow because all belong to the same cluster: cluster Organization Unit 0. However with this perspective we are not able to analyze the relationships that exist among clusters. We overcome this issue with the second perspective. In the second perspective, named Organizational Network, each node of the social network corresponds to a community/cluster composed by one or more originators, and each cluster has its own colour. As we can see in Figure 5.15 each community/cluster is depicted as a node labelled as Organization Unit N. Observe that members that constitute a cluster have the same colour in the first perspective as the cluster has in the second perspective. For example, John, Robert and Fred are represented with the yellow colour as well as their cluster in the second perspective.

56 42 CHAPTER 5. IMPLEMENTATION IN PROM Figure 5.15: Social Network of the insuranceclaimhandlingexample.mxml. This screen-shot represents the organization units at the 3rd iteration of Working Together AHC Algorithm, using tie break with modularity

57 5.3. MODULARITY Modularity Girvan and Newman are two researchers devoted to studying community structures. As it was discussed in section these two researchers developed an algorithm for finding community structures. However, like all other algorithms developed until then, their algorithm also did not provide a guide to how many communities a network should be split into. To address this problem they proposed that each division of the algorithm should be evaluated using a measure that they called - Modularity. Each division of the algorithm gives us a set of communities connected among them. Modularity, which is a quality measure for graph clustering, determines if a specific division is good or not. Having the definition of community described earlier, the best solution would be obtained by minimizing the number of edges connecting nodes belonging to different communities (or maximizing the number of edges belonging to the same community). However the solution is not so linear, because this solution corresponds to no division at all. So, Girvan and Newman created a more precise technique, defining a good division into communities as the one in which the number of edges between edges belonging to the same community is significantly greater than the number expected from a random distribution of edges. The division with higher value of modularity will be the best, and the one that is more similar to the real network. Next we will explain how the modularity is calculated Definition of Modularity To help explaining how to compute modularity, we will make some assumptions: Assume a network composed of N vertices connected by m links or edges; Let Aij be an element of the adjacency matrix (symmetric matrix) of the network, which gives the number of edges between vertices i and j, i.e.; the power of the relationship between element i and element j. An example of the adjacency matrix for the insuranceclaimhandlingexample.mxml log file is shown in Table 5.4; Finally, suppose we are given a division of the vertices into Nc communities [4]. The modularity of this division is defined to be the fraction of the edges that fall within the given groups minus the expected fraction if edges were distributed at random [20]. The fraction of the edges that fall within the given groups is expressed by A ij. And the expected number of edges falling between two vertices i and j, at random, is kikj/2m where ki is the degree of vertex i and kj is the degree of vertex j. Hence the actual minus expected number of edges between the same two vertices given by the following equation: q = A ij k ik j 2m (5.2)

58 44 CHAPTER 5. IMPLEMENTATION IN PROM Summing up all pairs of vertices in the same group, the modularity, denoted Q, is then given by the following equation: Q = 1 2m ij [ A ij k ] ik j δ(c i, c j ) (5.3) 2m Where ci is the group to which vertex i belongs and cj is the group to which vertex j belongs. δ(ci, cj) = 1 if vertex i and vertex j belong to the same cluster, and δ(ci, cj) = 0, if they belong to different clusters. The value of the modularity lies in the range [-1,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance. Fred Howard John Linda Mona Robert Vincent Fred Howard John Linda Mona Robert Vincent Table 5.4: Adjacency matrix, of insuranceclaimhandlingexample.mxml used to compute modularity. We will now explain more accurately how to compute modularity, and will take the case from Figure 5.15 as an example. As we can see in Figure 5.14 and Figure 5.15, we have four Organization Units: Organic Unit 0 contains: Fred, John, Robert; Organic Unit 1 contains: Howard; Organic Unit 2 contains: Linda and Vincent Organic Unit 3 contains: Mona From this social network we can see that there are two linkages that have the highest weight (weight = 2), one of them is the linkage between Organic Unit 1 and Organic Unit 3, and the other one is the linkage between Organic Unit 2 and Organic Unit 3. As explained in Section 1.1, according to AHC Algorithm we are facing a case of tie. Using the concept of modularity to decide which one of the cases to choose, the algorithm will compute the modularity for the two possible cases and will choose the one with the highest value of modularity. For the following calculus the degree of each originator of the social network is shown in Table 5.5 and the adjacency matrix A ij is shown in Table 5.4. Now we shall consider the first case. Assuming that Organic Unit 1 and Organic Unit 3 are grouped in the same cluster, we would have the following social network:

59 5.3. MODULARITY 45 Originator Fred 9 Howard 9 John 8 Linda 8 Mona 12 Robert 14 Vincent 6 Degree Table 5.5: Originators Degree of insuranceclaimhandlingexample.mxml social network. Organic Unit 0 contains: Fred, John, Robert; Organic Unit 1 contains: Howard and Mona; Organic Unit 2 contains: Linda and Vincent For this network, modularity is computed as following: Q = 1 2m ij [ A ij k ] ik j δ(c i, c j ) 2m = [(A 02 k 0k 2 66 ) + (A 20 k 2k 0 66 ) + (A 05 k 0k 5 66 ) + (A 50 k 5k 0 66 ) + (A 25 k 2k 5 66 )+ (A 52 k 5k 2 66 )] + [(A 36 k 3k 6 66 ) + (A 63 k 6k 3 66 )] + [(A 45 k 4k 5 66 ) + (A 54 k 5k 4 66 )] = [(( [( ) 2)] = ) 2) + (( ) 2) + ((3 ) 2)] + [( ) 2]+ Now we shall consider the second case, where Organic Unit 2 and Organic Unit 3 are grouped in the same cluster, the new social network would be: Organic Unit 0 contains: Fred, John, Robert; Organic Unit 1 contains: Howard; Organic Unit 2 contains: Linda, Vincent and Mona For this network, modularity is computed as following:

60 46 CHAPTER 5. IMPLEMENTATION IN PROM Q = 1 2m ij [ A ij k ] ik j δ(c i, c j ) 2m = [(A 02 k 0k 2 66 ) + (A 20 k 2k 0 66 ) + (A 05 k 0k 5 66 ) + (A 50 k 5k 0 66 ) + (A 25 k 2k 5 66 )+ (A 52 k 5k 2 66 )] + [(A 36 k 3k 6 66 ) + (A 63 k 6k 3 66 ) + (A 34 k 3k 4 66 ) + (A 43 k 4k 3 66 )+ (A 64 k 6k 4 66 ) + (A 46 k 4k 6 66 )] = [(( (( = 0, ) 2) + (( ) 2) + (( ) 2)] ) 2) + ((3 ) 2)] + [(( ) 2) As we can see, the first case has the highest value of modularity. So, in this case, the algorithm will choose to group Organic Unit 1 and Organic Unit 3 instead of Organic Unit 2 and Organic Unit 3. Our plug-in displays a chart showing the value of modularity of each iteration (iteration number X modularity) as we can see in Figure In the left side of the panel we say each of the iteration has the highest value of modularity. This iteration will be the one that represents the reality better. 5.4 Working Together vs. Similar Tasks The way that Agglomerative Hierarchical Clustering (AHC) and Modularity are implemented and computed is the same way from Working Together and Similar Tasks. The main difference between these two algorithms is that Working Together is based on joint case and a similar task is based on joint activities. So, when analyzing the log, these algorithms require different information, which makes Extracting information from Log file Stage the only stage that is different in these algorithms. We have already explained how we process this stage for the Working Together, so now we will explain how we do this for the Similar Tasks Algorithm. Similar Tasks Analyze focuses on the activities that each originator does. The assumption in this analysis is that people doing similar things have stronger relationships than people doing completely different things. Each individual has a profile based on how frequent they conduct specific activities. From insuranceclaimhandlingexample log file we extracted the profile of each originator, as we can see in Table 5.6. Since we know the profile of each originator, we can now determine which originators perform the same tasks and how many tasks they perform in common. From Table 5.6 we can achieve Table 5.7. Each cell of Table 5.7 indicates the power of the relationship between two originators. In this case the power of the relationship is determined according to the number of tasks the originators perform in common. The more tasks two individual perform in common, the

61 5.4. WORKING TOGETHER VS. SIMILAR TASKS 47 Figure 5.16: Modularity Chart. greater the power of the relationship is. For example, in Table 5.3 cell [John, Linda] has value 2 because both perform activities Archive claim and Send rejection letter activity. Figures 5.17 and 5.18 are screen-shots of Similar Tasks algorithms showing both perspectives, social and organizational. Archive claim Check all Check policy only Evaluate claim Issue payment Register Claim Send approval letter Fred Howard John Linda Mona Robert Vincent Send rejection letter Table 5.6: This table shows how many times each originator performs each task.

62 48 CHAPTER 5. IMPLEMENTATION IN PROM Figure 5.17: Similar Tasks Algorithm - Social Perspective Figure 5.18: Similar Tasks Algorithm - Organizational Perspective

63 5.5. CONCLUSION 49 Fred Howard John Linda Mona Robert Vincent Fred Howard John Linda Mona Robert Vincent Table 5.7: This table shows how many tasks two originators perform in common. 5.5 Conclusion In this chapter we have explained some basic information important to understand how our plug-in is computed and what kind of information can be derived way with these tools. We have explained separately each of the two sociograms that can be derived by this plug-in; Working Together and Similar Tasks.

64 placeholder

65 Chapter 6 Case Study In this chapter we demonstrate the use of Organizational Miner Cluster Plug-in in a realworld scenario. We present a case study based on the experience at a Portuguese public hospital. The focus of our study was the Emergency Department of Hospital of São Sebastião 1. Hospital of São Sebastião Hospital of São Sebastião (HSS), located in the north of Aveiro district, was built in 1996, but only came into operation on January 4, HSS appears as the first public hospital with private management in the country. For many years, the Portuguese healthcare sector has been struggling to overcome financial difficulties, because annual state budget for this sector is not enough for hospitals to offer citizens an efficient fast and quality health care service. To overcome this issue many hospitals have privatized their management department. Although hospitals continue to be public entities, their management is carried out by private companies. These changes in public healthcare sector have led to the appearance of Public Entity Business Hospitals which are public hospitals provided with an innovative management model, using methods, techniques and instruments commonly used by private industry. With this new management model, public hospitals are able to profitable the annual budget and step by step this sector is overcoming its financial problems. Since the beginning HSS has decided to take advantage of the new technologies and having Portuguese Universities and Microsoft as partners, decided to develop an Information System, fully adapted to its needs. Medtrix EPR 2 is an Electronic Patient Record system, which provides doctors and hospital staff with a clear accurate single view of each patient, ensuring that the very latest diagnosis and information is always available. The solution is based on the latest Microsoft technologies: Microsoft Visual Studio.NET 2003 and the Microsoft.NET Framework. [5] Medtrix EPR is composed by different modules. To make this case study about HSS

66 52 CHAPTER 6. CASE STUDY emergency department it was needed to analyze business processes from three different modules: Medtrix-Triage, Medtrix-Emergency and Imatrix (System that support all Radiology Emergency process. Medtrix EPR stores and integrates all data from all Medtrix Modules and other hospital applications in a single Data Base. From this DB it was needed to extract only data that was valuable and interesting for our study. To do so, a new BD was created containing only the data revelant for the study. Based on this new BD, event logs were built using a specific component from Medtrix - Medtrix Process Mining Studio (MPMS), a Log Preparator. This component allows the user to select specific data from a BD and creates a log that can be used by other components of Medtrix, or can be exported for MXML format. The second and last alternative was our case. This way we were able to get three MXML log files containing business process from HSS Emergency Department, which we will discuss next. Case Study To analyze organizational social network we studied three different logs: (1) a log with 12 days; (2) a log with 14 days and (3) a log with 6 months. All log files were already in MXML format, and in Table 6.8 we can see the characteristics of each file. Log File Total originators Total Process Instances Total Audit Trails Total Activities Log 14 Days Log 12 Days Log 6 Months Table 6.8: Characteristics of the three Hospital Log Files. Before explaining the outcomes of our Algorithms: Similar Tasks and Working Together algorithm it is important to be aware of some characteristics of the Emergency Department. From previous studies done in the same emergency department, using Social Network Mining tools, we know that: 1. There is only three types of roles: Doctors, Nurses and the Imaging and scan specialists (persons that do the medical scans); 2. There are Doctors from different specialties: Emergency, Pediatrics, Emergency relay, Obstetrics/Gynecology, Orthopedics, General surgery, Neurology, Internal Medicine, Ophthalmology, ENT (ear, nose and throat specialty), Pediatrics relay, Anesthesiology, Gastroenterology, SAP (Customer Service Standing, is a hospital Portuguese service parallel to the Emergency service) and Obstetrics relay; 3. Nurses only perform one type of task: triage ; 4. Imaging and scan specialists also only perform one type of task: take a scan ; 5. Doctors perform a wide variety of tasks, no matter from which specialty they are;

67 6.1. SIMILAR TASKS 53 Now, being aware of this background information, we are able to explain the outcomes we got from our algorithms. 6.1 Similar Tasks To test the precision of modularity concept and be sure that the result given by this concept is actually the best, we needed to evaluate a scenario where the results were already known. We know, from previous studies that: 1. In Emergency Department there are only three kinds of roles: nurses, doctors and Imaging and scan specialists; 2. Originators from the same role perform similar tasks; 3. Each role has its exclusive set of tasks, i.e., there is not any task that is performed in common by originators with different roles. After knowing the previous information it is expected from Similar Tasks Cluster Algorithm to identify three distinct groups each one corresponding to a single role and it is also expected not to find any kind of relationship among these groups. If the modularity concept is fully working this scenario should be the output of iteration with the highest value of modularity. Results The results were highly satisfactory. Figure 6.20 and Figure 6.19 show us the worst and best outcomes from all linkage rules with and without tie break. Figure 6.19 shows the best outcome, it returned exactly three communities each one composed by a single Organization Unit encompassing all originators performing the same role, i.e., Organization Unit 0 is composed only by Imaging and scan specialists, Organization Unit 1 is composed only by doctors and Organization Unit 2 is composed only by nurses. Figure 6.20 shows the worst outcome. Although in this one we have also obtained only three communities, they are no longer composed by a single Organization Unit, for example, as we can see in Figure 6.20 doctors community is composed by six Organization Units, in which three of them are composed only by a single doctor. Despite of this it is significant that all Organization Unit of doctors are connected between each other, and that doctor s community is isolated from the two other. This situation is similar for the two other communities. Although this is the worst case, the outcome is very good. Since our algorithm is agglomerative the last iteration will correspond to a division composed only by three distinct communities, each one corresponding to one of the three roles and without any relationship among them. In this case study the last iteration (228th iteration) is the one that corresponds to the reality, so if our modularity is determining the best division correctly,

68 54 CHAPTER 6. CASE STUDY Figure 6.19: Similar Tasks - Modularity Best Case. 228th iteration should be the one with the highest value of modularity. As our plug-in has given the high value of modularity in iterations between the 218th and 228th it is secure to say that our algorithm has given the results that are closer to the reality. 6.2 Working Together Using Working Together algorithm we tried two different approaches. The first one consisted in detecting who worked with whom. And in the second approach we focused only in originators with the role of doctor and identified which specialties worked together. We will now explain accurately each of the two approaches First Approach - Who works with whom? In this first approach our goal was to identify who worked with whom, and significant communities, i.e., detect few communities but very populated, with sparse links between them. As results, since we were analyzing a big network composed by many originators and from relative long periods (12 days, 14 days or 6 months) we were hoping to find significant communities (Organization Units composed by many originators) and patterns of behaviour, for example to determine a subset of originators that always work together and determine how the different communities interacted with one another.

69 6.2. WORKING TOGETHER 55 Figure 6.20: Similar Tasks - Modularity Worst Case. Unfortunately the outcomes were too far from the ones we expected. The iteration with the highest value of modularity returned as output, several communities (Organization Units) each one with only few elements. The weights of the links between communities ranged between [1,7] as we can see in Figure Results Although we were able to detect small communities in the three different logs, these communities were not the same in all of the logs. Let s suppose that in the 12 days log we detected a community composed by: DoctorA, DoctorB and NurseA. In the 14 days log these three individuals do not belong anymore to the same community. Taking into account the context of the logs, the reason why we obtained these results was due to particular social network features that we believe are important to realize. The first reason is that in a Hospital Emergency Department there is a high level of schedule randomness, i.e.; people who work in Hospital Emergency Department do not have a stable schedule and can easily change it. For this reasons we cannot find patterns of relationships between originators. The second reason is that, nurses after finishing the triage, the patient is routed for a doctor, and this doctor must be responsible for the patient until the patient gets discharged. He cannot pass the case to another doctor. Consequently, doctors hardly work with each one another, and the power of their relationship (the frequency with which Doctors work together) is relatively low. In cases that doctors work together is because: (1)

70 56 CHAPTER 6. CASE STUDY there was crew change ; (2) doctors are not working as they should be; (3) there was specialty changes ( ex.: a patient is initially routed for an emergency doctor, but this doctor realizes that the patient needs to be routed to a doctor with a specific specialty, like pediatrics ). That is why we were not able to detect significant communities, i.e.; a social network with few communities, each one very populated and sparse linkages between the different communities. Due to these two reasons, the network we analyzed is very sparse where the major of ties are weak. Facing these characteristics and according to strong and weak ties theory discussed earlier it is understandable why the algorithm could not find significant communities and patterns. Although we were not able to find a stable pattern we could notice that: 1. Within a group there is always at least a doctor and a nurse. 2. We found few groups composed only by a couple of originators working together in all three logs. Most certainly these originators have a very close relationship (they are friends, married, etc.). 3. If we focus only on Nurses we observe that they rarely work with one another. The few times that we see two nurses working together it is due to crew change, where they need to update the nurse that will replace them. Figure 6.22 shows a matrix that depicts the relationships among nurses during 12 days in the Emergency Department at HSS. This matrix was obtained selecting only the rows and columns corresponding to nurses from the main matrix (matrix representing all the relationships between all originators in the emergency). 4. The previous conclusion is also valid for Imaging/Scan specialists. If we focus only on Imaging/Scan specialists we observe that they rarely work with one another Second Approach - Which specialties work together? The previous approach we tried to identify a relationship at the individual level and tried to identify relations between nurses, doctors and Imaging/Scan specialists. Unfortunately the outcomes were not so rich as we expected due to the higher randomness. To overcome this problem we decided to focus only in doctors and find which doctors work together. The reason why we tried to focus only in doctors is because they are the only ones that work with one another (nurses never work with nurses, and Imaging/Scan specialists never work with Imaging/Scan specialists). If instead we decided to focus only on nurses or Imaging/Scan specialists we would obtain a network composed by several communities, all composed by a single originator and with none relationship among them, i.e., we would only obtain islands and would not be able to get any kind of conclusions. As we know from previous studies, the specialty of each doctor, focusing only on Doctors we will be able with our algorithm to discover which specialties work with one another.

71 6.2. WORKING TOGETHER 57 Figure 6.21: Social network of the event log with 12 days. This is the output of the iteration with the highest modularity of Average Linkage with tie break. Before applying our algorithm, we did a preprocessing of the log file. Since we wanted to analyze only data related to the doctors we excluded all nurses and Imaging/Scan specialists from the event logs. Second we excluded all Process Instance where only one doctor worked in. This way we eliminated all the doctors that always work alone, and that originate islands in the social network. These islands would never bring new value to our study because we are only interested in relationships among communities. Results In this second approach we were able to get more stable outcomes. Social networks that we discovered in this case for all Linkage Rules, using tie break with modularity or not, were very similar to the Figure In the figure each node corresponds to a doctor or a group of doctors, and each colour corresponds to the specialty of that doctor/group. We could get the following information about each specialty: Emergency Specialty It is the specialty that works more often in group in the same specialty; i.e.; Emergency doctors working with other Emergency doctors, this means that there is a wide range of Emergency Doctors that work frequently together in the same case.

72 58 CHAPTER 6. CASE STUDY Figure 6.22: Matrix from log 12 days showing relationships among nurses

73 6.2. WORKING TOGETHER 59 Figure 6.23: Social network of the event log with 12 days. This is the output of the iteration with the highest modularity of Complete Linkage with tie break. GREEN = Emergency, BLUE = Pediatrics; PINK = Obstetrics/Gynecology, RED = Orthopedics, ORANGE = Emergency relay, DARK PURPLE = General surgery, LIGHT PURPLE = Neurology and BROWN = Internal Medicine

74 60 CHAPTER 6. CASE STUDY different communities of doctors are always linked. We do not have communities of emergency doctors isolated from the others, without any kind of connection. the size of communities (Organic Units) range between 1 to 30 elements. It is the specialty that forms bigger communities. this specialty works with a wide range of other specialties, almost with every specialty in the network. this specialty always has one or two communities that play a central role in the social network. Pediatrics Specialty It is the second specialty with higher tendency to work in group in the same specialty; i.e.; Pediatrics working with other Pediatrics. We found some communities composed only by Pediatrics. the different communities composed all by pediatrics communicate between them. this specialty sometimes has the tendency to create islands. We can find a single community of pediatrics isolated from the rest of the social network as we can see in figure 6.24, or we can find a small group of pediatric communities that communicate between them but are isolated from the rest of the network as we can see in Figure These two situations occurred more frequently in the log of 12 days. the size of communities (Organic Units) range between 1 to 4 elements. But size of 4 is very rare. Obstetrics/Gynecology Specialty this specialty often gets isolated from the other communities. the size of communities (Organic Units) range between 1 to 2 elements. But size of 4 is very rare. this specialty sometimes has tendency to create islands. We can find a small group of pediatric communities that communicate between them but are isolated from the rest of the network as we can see in Figure this specialty occasionally forms organizations units with elements of emergency specialty. Orthopedics Specialty Orthopedic Communities are very rare. these communities (Organic Units) are composed by only one element. these communities always appear at the periphery of the network as we can see in Figure Emergency relay Specialty Emergency relay communities are very rare.

75 6.2. WORKING TOGETHER 61 Figure 6.24: Social network of the event log with 14 days. This is the output of the iteration with the highest modularity of Single Linkage. Figure 6.25: Social network of the event log with 14 days. This is the output of the iteration with the highest modularity of Complete Linkage.

76 62 CHAPTER 6. CASE STUDY these communities (Organic Units) are composed by only one element. these communities always appear at the periphery of the network as we can see in Figure General surgery Specialty General surgery communities are very rare. these communities (Organic Units) are composed by only one element. these communities always appear at the periphery of the network as we can see in Figure Neurology Specialty Neurology communities are very rare. these communities (Organic Units) are composed by only one element. these communities always appear at the periphery of the network In this study we were also able to discover that some specialties never work together. In this situation we have: Obstetrics/Gynecology Specialty & Orthopedics; Obstetrics/Gynecology Specialty & Pediatrics; Orthopedics & Pediatrics; General surgery & Pediatrics; General surgery & Orthopedics; 6.3 Relationship to the Business Process So far we have only presented our conclusions about the social network. But with this analysis we can also take conclusions about Business Processes, which are illustrated in Figure We will now further explain our conclusions about business process. Nurses only perform one task (triage). This knowledge let us assume that nurses are the ones who initiate all the processes and this is their only participation in the whole process as we can see in Figure Communities under Emergency specialty not only predominate in the network as well as they are always in the center of the network, playing a central role, and highly connected with several communities. While emergency specialty is always in the center of the network, other specialties are always on the periphery of the network. This leads us to assume that after the nurses triage the patients, mostly of them are directly routed to emergency doctors. Only in specific and few cases, probably when the emergency doctors have done everything they could do and need the intervention of a more specialized doctor, are patients routed by emergency doctors, to doctors of a specific specialty. The existence of community islands, like the ones of Pediatrics and Obstetrics/Gynecology Specialty as seen in Figure 6.23 let us assume that after the triage done by nurses, a minority of the patients is directly routed to this specialties without passing trough emergency doctors. In fact, when we analyze minutely these cases we observe that they refer to the entry, for example, of children, which are directly routed to Pediatrics or the entry of pregnant women, which are directly routed to Obstetrics/Gynecology specialty.

77 6.4. CONCLUSION 63 Figure 6.26: Emergency Department Business Process 6.4 Conclusion This case study proved to be an important test to the capabilities of our approach, demonstrating the value that can be derived with its application in real-world scenarios. The experiments carried out in this case study were intended to demonstrate the use of our approach and understanding the usefulness of its application. To determine if our algorithm was working as it should be, we needed to find out: (1) if our algorithm was finding communities correctly, (2) and we also needed to make sure that the concept of Modularity really worked, that it was helpful and precise. Experiments done within Working Together plug-in were crucial to prove the first point. A communities, as we have already seen, it is an aggregate of originators that have some kind of similarity between them. In our case similarity between two originators is depicted as the power of the relationship between them. If two individuals are too similar, then they will have a very strong relationship. Between communities we should expect weak ties. In fact, when we observe the outputs of our algorithm we see that networks accomplish the concept defended in Weak and Strong Ties Theory. Our algorithm creates communities in which we find very strong relationships and between communities we find weak ties. For example, let s assume the network of the event log with 12 days; the power of the relationship ranges between 0 and 54.When we analyze the network with the highest value of modularity we observe that all ties between communities have no power higher then 1 are very weak, while within communities are the stronger ties. In Working Together plug-in two different experiments were driven away, and we were able to achieve completely different information in each experiment. This way we were able to prove that Working Together is a powerful tool and is not restricted to achieve only one kind of information. Experiments done within Similar Tasks plug-in were valuable to demonstrate the usefulness of Modularity concept and to prove that this concept can really determine the existing communities. An important and valuable contribution was that with these experiments we were able

78 64 CHAPTER 6. CASE STUDY not only to derive conclusions and information about social network but we were also able to derive and to depict a business process based on the information extracted about the social network. This proves that the three different perspectives of process mining are inextricably intertwined.

79 placeholder

80 placeholder

81 Chapter 7 Conclusions Process mining is becoming more and more important with the spread of Processaware Information Systems and the need to understand the processes performed in an organization. The goal of this area is to use event logs produced by those systems to extract useful information about business processes. In this dissertation we have studied process mining, giving special emphasis on the organization perspective. We developed a solution capable of dealing with existing challenges. These challenges arise mainly due to the complexity of large networks and the difficulty to represent the large amount of information in a friendly way easy to understand. The contributions made and suggestions for future work are presented in this closing chapter. 7.1 Main Contributions The focus of this work was to contribute for the development of organizational perspective of process mining. For a long time the main focus in this perspective, was on developing advanced techniques for deriving a flat model and analyzing relationships at the individual level. Whereas the study at a higher level, the level that allows us to study which communities exist inside an organization and how they are connected; has been neglected. Thus we decided to contribute by developing a new technique for identifying community structures. This new technique implements a Agglomerative Hierarchical Clustering (AHC) in ProM framework as a plug-in, which is called Organizatinal Cluster Miner. Using Working Together algorithm from our plug-in, users of ProM, are not only able two analyze the relationship between originators that work together on cases, but also to analyze which originators belong to the same team and the relationship among different teams. And using Similar Tasks users are now also able to identify teams of originators that perform the same tasks. The technique developed in this work incorporates several important features. One of them is the adoption of Modularity concept. Until now ProM did not used this concept. 67

82 68 CHAPTER 7. CONCLUSIONS Similar Tasks Algorithm we were able to prove, with a real-world scenario, that modularity fully works and is very helpful. One of the big challenges of process mining is to represent a huge amount of information in a friendly and easy way. In the beginning of the implementation we thought that a graphical representation of the network would be enough, however very dense large graphs, no matter how good the design tool is, it becomes impractical to manipulate the graph and to analyze it. We were able to identify this challenge because our real-world case was from a Emergency Department from a medium size Hospital, where event logs are usually very large. To overcome this issue, we decided that graphical representation should be supplemented with textual information. Therefore we have chosen matrix representation, this way, when it gets impossible to analyze/see the relationships through the graph, user can resort to matrix representation. We have also developed a feature that allows the user to analyze each cluster individually, either in graphical and matrix representation. This feature is represented by Figures 7.27 and 7.28 which show the sub-network of Organization Unit 0 from Figure 6.21 Figure 7.27: Matrix view of the sub-network from Organization Unit 0 from iteration with the highest modularity of Average Linkage with tie break from 12 days event log. With the experiments conducted in a real-world scenario, we proved that with Organizatinal Cluster Miner, we are capable of dealing with complex event logs and achieve interesting outcomes that so far in ProM, would be too difficult, if not impossible, to get. We were able to fulfill all the goals initially established in this work. 7.2 Future work Since it was proved in this dissertation that modularity is a valuable means for social network analysis and it really works, it would be valuable, in future work, to extend the modularity concept to other metrics used to analyze social networks. Organizational Miner Cluster plug-in available six algorithms (AHC algorithm single linkage, AHC algorithm average linkage, AHC algorithm complete linkage, AHC algorithm single linkage with tie break, AHC algorithm average linkage with tie break and AHC algorithm complete linkage with tie break). It would be interesting to determine the level of similarity between the results of different algorithms and identify if existing groups of originators working always together in all six possible results. If one group is

83 7.2. FUTURE WORK 69 Figure 7.28: Graph view of the sub-network from Organization Unit 0 from iteration with the highest modularity of Average Linkage with tie break from 12 days event log. persistent in all results, then is sure that they definitely work together. These improvements can be implemented as additional features of the proposed plug-in. The plug-in will be available in the upcoming ProM v6.0.

84 placeholder

85 Bibliography [1] A. ROZINAT, R.S.M.S. & DER AALST, W.M.P.V. (2007). Discovering colored petri nets from event logs. Int. J. Softw. Tools Technol. Transf., 10, [2] BOULET, R., JOUVE, B., ROSSI, F. & VILLA, N. (2008). Batch kernel som and related laplacian methods for social network analysis. Neurocomput., 71, [3] C. HU, P.R. (2008). Visual representation of knowledge networks: A social network analysis of hospitality research domain. International Journal of Hospitality Management, 27, [4] CLAUSET, A., NEWMAN, M.E.J. & MOORE, C. (2004). Finding community structure in very large networks. Physical Review E, 70. [5] CORPORATION, M. (2006). Hospital moves towards patient-centric: Healthcare with development tools. [6] CROSS, R. (2001). Knowing what we know: Supporting knowledge creation and sharing in social networks. Organizational Dynamics, 30, [7] DONGEN, B.F.V. & VAN DER AALST, W.M.P. (2005). A meta model for process mining data [8] GIRVAN, M. & NEWMAN, M.E.J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 99, [9] GIRVAN, M. & NEWMAN, M.E.J. (2004). Finding and evaluating community structure in networks. Physical review. E, Statistical, nonlinear, and soft matter physics, 69. [10] GRANOVETTER, M.S. (1973). The strength of weak ties. The American Journal of Sociology, 78, [11] HANSEN, D. & SHNEIDERMAN, B. (2009). Analyzing social media networks: Learning by doing with nodexl. [12] HAYTHORNTHWAITE, C. (1996). Social network analysis: An approach and technique for the study of information exchange. 18, [13] HUISMAN & DUIJN, M.V. (2005). Software for social network analysis

86 72 BIBLIOGRAPHY [14] JAMALI, M. & ABOLHASSANI, H. (2006). Different aspects of social network analysis. In WI 06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, 66 72, IEEE Computer Society, Washington, DC, USA. [15] MELIKE BOZKAYA, J.G. & VAN DER WERF, J.M. (2009). Process diagnostics: a method based on process mining. Library and Information Science Research. [16] MOSES, N.P. & BOUDOURIDES, M.A. (2001). Electronic weak ties in network organisations. In In 4th GOR Conference. [17] NAKATUMBA, J. & VAN DER AALST, W.M.P. (2009). Analyzing Resource Behavior Using Process Mining. [18] NETJES, M. & REIJERS, H.A. (2006). Supporting the bpm life-cycle with filenet. [19] NEWMAN, M.E.J. (2004). Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38, [20] NEWMAN, M.E.J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103, [21] REIJERS, H.A., WEIJTERS, A.J.M.M., DONGEN, B.F.V., MEDEIROS, A.K.A.D., SONG, M. & VERBEEK, H.M.W. (2007). Business process mining: An industrial application. Information Systems, 32. [22] RINDERLE, S. & VAN DER AALST, W. (2007). Life-cycle support for staff assignment rules in process-aware information systems. [23] SONG, M. & DER AALST, W.M.P.V. (2008). Towards comprehensive support for organizational mining. Decis. Support Syst., 46, [24] SONG, M. & GUNTHER, C.W. (2008). Trace clustering in process mining. Proceedings of the 4th Workshop on Business Process Intelligence. [25] SONG, M. & VAN DER AALST, W. (2004). Mining social networks: Uncovering interaction patterns in business processes. vol. 3080, , Springer-Verlag. [26] VAN DER AALST, W.M.P., REIJERS, H.A. & SONG, M. (2005). Discovering social networks from event logs. Comput. Supported Coop. Work, 14, [27] VAN DONGEN, B.F., DE MEDEIROS, A.K.A., VERBEEK, H.M.W., WEIJTERS, A.J.M.M. & VAN DER AALST, W.M.P. (2005). The prom framework: A new era in process mining tool support. [28] WASSERMAN, S. & FAUST, K. (1994). Social Network Analysis: Methods and Applications. Cambridge University Press, New York, USA.

87 placeholder

88 Appendix A Log File - insuranceclaimhandlingexample.mxml The following MXML was used in Chapter 5 of this dissertation as an example to explain how our new technique for finding community structures was implemented. This log file was created by Anne Rozinat and is available as an example log file in ProM v5.2. Listing A.1: insuranceclaimhandlingexample.mxml 1 2 <?xml version= 1.0 encoding= UTF 8?> 3 <WorkflowLog xmlns:xsi= org /2001/XMLSchema instance 4 xsi:nonamespaceschemalocation= WorkflowLog. xsd 5 d e s c r i p t i o n = Test log f o r d e cision miner > 6 7 <Source program= name:, d e s c :, d a t a : {program=none} > 8 <Data> 9 <A t t r i b u t e name= program >name:, d e s c :, d a t a : {program=none}</ A t t r i b u t e> 10 </Data> 11 </Source> 12 <Process id= 0 d e s c r i p t i o n = > 13 14 <AuditTrailEntry> 15 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 16 <EventType>s t a r t</eventtype> 17 <Timestamp> T 1 0 : 5 5 : : 0 0</Timestamp> 18 <Originator>John</Originator> 19 </AuditTrailEntry> 20 <AuditTrailEntry> 21 <Data> 22 <A t t r i b u t e name= Amount >1000</ A t t r i b u t e> 23 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 24 <A t t r i b u t e name= PolicyType >premium</attribute> 25 </Data> 26 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 27 <EventType>complete</EventType> 28 <Timestamp> T 1 0 : 5 9 : : 0 0</Timestamp> 29 <Originator>John</Originator> 30 </AuditTrailEntry> 31 <AuditTrailEntry> 32 <WorkflowModelElement>Check a l l</workflowmodelelement> 33 <EventType>s t a r t</eventtype> 34 <Timestamp> T 1 1 : 5 6 : : 0 0</Timestamp> 35 <Originator>Fred</Originator> 36 </AuditTrailEntry> 37 <AuditTrailEntry> 38 <WorkflowModelElement>Check a l l</workflowmodelelement> 39 <EventType>complete</EventType> 40 <Timestamp> T 1 2 : 0 0 : : 0 0</Timestamp> 41 <Originator>Fred</Originator> 42 </AuditTrailEntry> 43 <AuditTrailEntry> 44 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 45 <EventType>s t a r t</eventtype> 46 <Timestamp> T 1 2 : 0 1 : : 0 0</Timestamp> 74

89 47 <Originator>Fred</Originator> 48 </AuditTrailEntry> 49 <AuditTrailEntry> 50 <Data> 51 <Attribute name= Status >approved</attribute> 52 </Data> 53 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 54 <EventType>complete</EventType> 55 <Timestamp> T 1 2 : 0 9 : : 0 0</Timestamp> 56 <Originator>Fred</Originator> 57 </AuditTrailEntry> 58 <AuditTrailEntry> 59 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 60 <EventType>s t a r t</eventtype> 61 <Timestamp> T 1 2 : 4 5 : : 0 0</Timestamp> 62 <Originator>Robert</Originator> 63 </AuditTrailEntry> 64 <AuditTrailEntry> 65 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 66 <EventType>complete</EventType> 67 <Timestamp> T 1 3 : 0 5 : : 0 0</Timestamp> 68 <Originator>Robert</Originator> 69 </AuditTrailEntry> 70 <AuditTrailEntry> 71 <WorkflowModelElement>Issue payment</workflowmodelelement> 72 <EventType>s t a r t</eventtype> 73 <Timestamp> T 1 3 : 3 3 : : 0 0</Timestamp> 74 <Originator>Howard</Originator> 75 </AuditTrailEntry> 76 <AuditTrailEntry> 77 <WorkflowModelElement>Issue payment</workflowmodelelement> 78 <EventType>complete</EventType> 79 <Timestamp> T 1 4 : 0 1 : : 0 0</Timestamp> 80 <Originator>Howard</Originator> 81 </AuditTrailEntry> 82 <AuditTrailEntry> 83 <WorkflowModelElement>Archive claim</workflowmodelelement> 84 <EventType>s t a r t</eventtype> 85 <Timestamp> T 1 4 : 5 6 : : 0 0</Timestamp> 86 <Originator>Robert</Originator> 87 </AuditTrailEntry> 88 <AuditTrailEntry> 89 <WorkflowModelElement>Archive claim</workflowmodelelement> 90 <EventType>complete</EventType> 91 <Timestamp> T 1 5 : 5 6 : : 0 0</Timestamp> 92 <Originator>Robert</Originator> 93 </AuditTrailEntry> 94 </ProcessInstance> 95 96 <AuditTrailEntry> 97 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 98 <EventType>s t a r t</eventtype> 99 <Timestamp> T 0 9 : 5 2 : : 0 0</Timestamp> 100 <Originator>Mona</Originator> 101 </AuditTrailEntry> 102 <AuditTrailEntry> 103 <Data> 104 <A t t r i b u t e name= Amount >700</ A t t r i b u t e> 105 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 106 <A t t r i b u t e name= PolicyType >Normal</Attribute> 107 </Data> 108 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 109 <EventType>complete</EventType> 110 <Timestamp> T 0 9 : 5 9 : : 0 0</Timestamp> 111 <Originator>Mona</Originator> 112 </AuditTrailEntry> 113 <AuditTrailEntry> 114 <WorkflowModelElement>Check a l l</workflowmodelelement> 115 <EventType>s t a r t</eventtype> 116 <Timestamp> T 1 0 : 1 2 : : 0 0</Timestamp> 117 <Originator>Robert</Originator> 118 </AuditTrailEntry> 119 <AuditTrailEntry> 120 <WorkflowModelElement>Check a l l</workflowmodelelement> 121 <EventType>complete</EventType> 122 <Timestamp> T 1 0 : 5 6 : : 0 0</Timestamp> 123 <Originator>Robert</Originator> 124 </AuditTrailEntry> 125 <AuditTrailEntry> 126 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 127 <EventType>s t a r t</eventtype> 128 <Timestamp> T 1 1 : 0 2 : : 0 0</Timestamp> 129 <Originator>Fred</Originator> 130 </AuditTrailEntry> 131 <AuditTrailEntry> 132 <Data> 133 <Attribute name= Status >rejected</attribute> 134 </Data> 135 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 136 <EventType>complete</EventType> 137 <Timestamp> T 1 1 : 3 9 : : 0 0</Timestamp> 75

90 76 APPENDIX A. LOG FILE - INSURANCECLAIMHANDLINGEXAMPLE.MXML 138 <Originator>Fred</Originator> 139 </AuditTrailEntry> 140 <AuditTrailEntry> 141 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 142 <EventType>s t a r t</eventtype> 143 <Timestamp> T 1 1 : 5 2 : : 0 0</Timestamp> 144 <Originator>John</Originator> 145 </AuditTrailEntry> 146 <AuditTrailEntry> 147 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 148 <EventType>complete</EventType> 149 <Timestamp> T 1 2 : 0 3 : : 0 0</Timestamp> 150 <Originator>John</Originator> 151 </AuditTrailEntry> 152 <AuditTrailEntry> 153 <WorkflowModelElement>Archive claim</workflowmodelelement> 154 <EventType>s t a r t</eventtype> 155 <Timestamp> T 1 2 : 5 2 : : 0 0</Timestamp> 156 <Originator>John</Originator> 157 </AuditTrailEntry> 158 <AuditTrailEntry> 159 <WorkflowModelElement>Archive claim</workflowmodelelement> 160 <EventType>complete</EventType> 161 <Timestamp> T 1 3 : 5 9 : : 0 0</Timestamp> 162 <Originator>John</Originator> 163 </AuditTrailEntry> 164 </ProcessInstance> 165 166 <AuditTrailEntry> 167 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 168 <EventType>s t a r t</eventtype> 169 <Timestamp> T 0 9 : 5 2 : : 0 0</Timestamp> 170 <Originator>Robert</Originator> 171 </AuditTrailEntry> 172 <AuditTrailEntry> 173 <Data> 174 <A t t r i b u t e name= Amount >550</ A t t r i b u t e> 175 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 176 <A t t r i b u t e name= PolicyType >Normal</Attribute> 177 </Data> 178 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 179 <EventType>complete</EventType> 180 <Timestamp> T 0 9 : 5 9 : : 0 0</Timestamp> 181 <Originator>Robert</Originator> 182 </AuditTrailEntry> 183 <AuditTrailEntry> 184 <WorkflowModelElement>Check a l l</workflowmodelelement> 185 <EventType>s t a r t</eventtype> 186 <Timestamp> T 1 0 : 1 2 : : 0 0</Timestamp> 187 <Originator>Mona</Originator> 188 </AuditTrailEntry> 189 <AuditTrailEntry> 190 <WorkflowModelElement>Check a l l</workflowmodelelement> 191 <EventType>complete</EventType> 192 <Timestamp> T 1 0 : 3 3 : : 0 0</Timestamp> 193 <Originator>Mona</Originator> 194 </AuditTrailEntry> 195 <AuditTrailEntry> 196 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 197 <EventType>s t a r t</eventtype> 198 <Timestamp> T 1 0 : 5 2 : : 0 0</Timestamp> 199 <Originator>Fred</Originator> 200 </AuditTrailEntry> 201 <AuditTrailEntry> 202 <Data> 203 <Attribute name= Status >approved</attribute> 204 </Data> 205 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 206 <EventType>complete</EventType> 207 <Timestamp> T 1 1 : 1 2 : : 0 0</Timestamp> 208 <Originator>Fred</Originator> 209 </AuditTrailEntry> 210 <AuditTrailEntry> 211 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 212 <EventType>s t a r t</eventtype> 213 <Timestamp> T 1 1 : 3 2 : : 0 0</Timestamp> 214 <Originator>Fred</Originator> 215 </AuditTrailEntry> 216 <AuditTrailEntry> 217 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 218 <EventType>complete</EventType> 219 <Timestamp> T 1 1 : 4 9 : : 0 0</Timestamp> 220 <Originator>Fred</Originator> 221 </AuditTrailEntry> 222 <AuditTrailEntry> 223 <WorkflowModelElement>Issue payment</workflowmodelelement> 224 <EventType>s t a r t</eventtype> 225 <Timestamp> T 1 1 : 5 2 : : 0 0</Timestamp> 226 <Originator>Howard</Originator> 227 </AuditTrailEntry> 228 <AuditTrailEntry>

91 229 <WorkflowModelElement>Issue payment</workflowmodelelement> 230 <EventType>complete</EventType> 231 <Timestamp> T 1 2 : 0 9 : : 0 0</Timestamp> 232 <Originator>Howard</Originator> 233 </AuditTrailEntry> 234 <AuditTrailEntry> 235 <WorkflowModelElement>Archive claim</workflowmodelelement> 236 <EventType>s t a r t</eventtype> 237 <Timestamp> T 1 2 : 2 2 : : 0 0</Timestamp> 238 <Originator>Robert</Originator> 239 </AuditTrailEntry> 240 <AuditTrailEntry> 241 <WorkflowModelElement>Archive claim</workflowmodelelement> 242 <EventType>complete</EventType> 243 <Timestamp> T 1 2 : 5 6 : : 0 0</Timestamp> 244 <Originator>Robert</Originator> 245 </AuditTrailEntry> 246 </ProcessInstance> 247 248 <AuditTrailEntry> 249 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 250 <EventType>s t a r t</eventtype> 251 <Timestamp> T 0 9 : 5 2 : : 0 0</Timestamp> 252 <Originator>Robert</Originator> 253 </AuditTrailEntry> 254 <AuditTrailEntry> 255 <Data> 256 <A t t r i b u t e name= Amount >500</ A t t r i b u t e> 257 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 258 <A t t r i b u t e name= PolicyType >Normal</Attribute> 259 </Data> 260 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 261 <EventType>complete</EventType> 262 <Timestamp> T 1 0 : 1 1 : : 0 0</Timestamp> 263 <Originator>Robert</Originator> 264 </AuditTrailEntry> 265 <AuditTrailEntry> 266 <WorkflowModelElement>Check p olicy only</workflowmodelelement> 267 <EventType>s t a r t</eventtype> 268 <Timestamp> T 1 0 : 3 2 : : 0 0</Timestamp> 269 <Originator>Mona</Originator> 270 </AuditTrailEntry> 271 <AuditTrailEntry> 272 <WorkflowModelElement>Check p olicy only</workflowmodelelement> 273 <EventType>complete</EventType> 274 <Timestamp> T 1 0 : 5 9 : : 0 0</Timestamp> 275 <Originator>Mona</Originator> 276 </AuditTrailEntry> 277 <AuditTrailEntry> 278 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 279 <EventType>s t a r t</eventtype> 280 <Timestamp> T 1 1 : 2 2 : : 0 0</Timestamp> 281 <Originator>Linda</Originator> 282 </AuditTrailEntry> 283 <AuditTrailEntry> 284 <Data> 285 <Attribute name= Status >approved</attribute> 286 </Data> 287 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 288 <EventType>complete</EventType> 289 <Timestamp> T 1 1 : 4 7 : : 0 0</Timestamp> 290 <Originator>Linda</Originator> 291 </AuditTrailEntry> 292 <AuditTrailEntry> 293 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 294 <EventType>s t a r t</eventtype> 295 <Timestamp> T 1 1 : 5 2 : : 0 0</Timestamp> 296 <Originator>Linda</Originator> 297 </AuditTrailEntry> 298 <AuditTrailEntry> 299 <WorkflowModelElement>Send approval l e t t e r</workflowmodelelement> 300 <EventType>complete</EventType> 301 <Timestamp> T 1 2 : 1 2 : : 0 0</Timestamp> 302 <Originator>Linda</Originator> 303 </AuditTrailEntry> 304 <AuditTrailEntry> 305 <WorkflowModelElement>Issue payment</workflowmodelelement> 306 <EventType>s t a r t</eventtype> 307 <Timestamp> T 1 2 : 2 5 : : 0 0</Timestamp> 308 <Originator>Vincent</Originator> 309 </AuditTrailEntry> 310 <AuditTrailEntry> 311 <WorkflowModelElement>Issue payment</workflowmodelelement> 312 <EventType>complete</EventType> 313 <Timestamp> T 1 2 : 3 6 : : 0 0</Timestamp> 314 <Originator>Vincent</Originator> 315 </AuditTrailEntry> 316 <AuditTrailEntry> 317 <WorkflowModelElement>Archive claim</workflowmodelelement> 318 <EventType>s t a r t</eventtype> 319 <Timestamp> T 1 2 : 5 2 : : 0 0</Timestamp> 77

92 78 APPENDIX A. LOG FILE - INSURANCECLAIMHANDLINGEXAMPLE.MXML 320 <Originator>Mona</Originator> 321 </AuditTrailEntry> 322 <AuditTrailEntry> 323 <WorkflowModelElement>Archive claim</workflowmodelelement> 324 <EventType>complete</EventType> 325 <Timestamp> T 1 3 : 2 3 : : 0 0</Timestamp> 326 <Originator>Mona</Originator> 327 </AuditTrailEntry> 328 </ProcessInstance> 329 330 <AuditTrailEntry> 331 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 332 <EventType>s t a r t</eventtype> 333 <Timestamp> T 0 9 : 5 2 : : 0 0</Timestamp> 334 <Originator>Mona</Originator> 335 </AuditTrailEntry> 336 <AuditTrailEntry> 337 <Data> 338 <A t t r i b u t e name= Amount >50</ A t t r i b u t e> 339 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 340 <A t t r i b u t e name= PolicyType >Normal</Attribute> 341 </Data> 342 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 343 <EventType>complete</EventType> 344 <Timestamp> T 1 0 : 2 7 : : 0 0</Timestamp> 345 <Originator>Mona</Originator> 346 </AuditTrailEntry> 347 <AuditTrailEntry> 348 <WorkflowModelElement>Check p olicy only</workflowmodelelement> 349 <EventType>s t a r t</eventtype> 350 <Timestamp> T 1 0 : 5 2 : : 0 0</Timestamp> 351 <Originator>Howard</Originator> 352 </AuditTrailEntry> 353 <AuditTrailEntry> 354 <WorkflowModelElement>Check p olicy only</workflowmodelelement> 355 <EventType>complete</EventType> 356 <Timestamp> T 1 1 : 0 5 : : 0 0</Timestamp> 357 <Originator>Howard</Originator> 358 </AuditTrailEntry> 359 <AuditTrailEntry> 360 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 361 <EventType>s t a r t</eventtype> 362 <Timestamp> T 1 1 : 1 7 : : 0 0</Timestamp> 363 <Originator>Linda</Originator> 364 </AuditTrailEntry> 365 <AuditTrailEntry> 366 <Data> 367 <Attribute name= Status >rejected</attribute> 368 </Data> 369 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 370 <EventType>complete</EventType> 371 <Timestamp> T 1 1 : 4 3 : : 0 0</Timestamp> 372 <Originator>Linda</Originator> 373 </AuditTrailEntry> 374 <AuditTrailEntry> 375 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 376 <EventType>s t a r t</eventtype> 377 <Timestamp> T 1 2 : 0 9 : : 0 0</Timestamp> 378 <Originator>Vincent</Originator> 379 </AuditTrailEntry> 380 <AuditTrailEntry> 381 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 382 <EventType>complete</EventType> 383 <Timestamp> T 1 2 : 2 3 : : 0 0</Timestamp> 384 <Originator>Vincent</Originator> 385 </AuditTrailEntry> 386 <AuditTrailEntry> 387 <WorkflowModelElement>Archive claim</workflowmodelelement> 388 <EventType>s t a r t</eventtype> 389 <Timestamp> T 1 2 : 4 2 : : 0 0</Timestamp> 390 <Originator>Mona</Originator> 391 </AuditTrailEntry> 392 <AuditTrailEntry> 393 <WorkflowModelElement>Archive claim</workflowmodelelement> 394 <EventType>complete</EventType> 395 <Timestamp> T 1 3 : 1 3 : : 0 0</Timestamp> 396 <Originator>Mona</Originator> 397 </AuditTrailEntry> 398 </ProcessInstance> 399 400 <AuditTrailEntry> 401 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 402 <EventType>s t a r t</eventtype> 403 <Timestamp> T 0 7 : 4 3 : : 0 0</Timestamp> 404 <Originator>Robert</Originator> 405 </AuditTrailEntry> 406 <AuditTrailEntry> 407 <Data> 408 <A t t r i b u t e name= Amount >200</ A t t r i b u t e> 409 <A t t r i b u t e name= CustomerID >C </ A t t r i b u t e> 410 <A t t r i b u t e name= PolicyType >premium</attribute>

93 411 </Data> 412 <WorkflowModelElement>R e g i s t e r Claim</WorkflowModelElement> 413 <EventType>complete</EventType> 414 <Timestamp> T 0 8 : 0 6 : : 0 0</Timestamp> 415 <Originator>Robert</Originator> 416 </AuditTrailEntry> 417 <AuditTrailEntry> 418 <WorkflowModelElement>Check a l l</workflowmodelelement> 419 <EventType>s t a r t</eventtype> 420 <Timestamp> T 0 8 : 3 2 : : 0 0</Timestamp> 421 <Originator>John</Originator> 422 </AuditTrailEntry> 423 <AuditTrailEntry> 424 <WorkflowModelElement>Check a l l</workflowmodelelement> 425 <EventType>complete</EventType> 426 <Timestamp> T 0 9 : 1 3 : : 0 0</Timestamp> 427 <Originator>John</Originator> 428 </AuditTrailEntry> 429 <AuditTrailEntry> 430 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 431 <EventType>s t a r t</eventtype> 432 <Timestamp> T 0 9 : 4 6 : : 0 0</Timestamp> 433 <Originator>Linda</Originator> 434 </AuditTrailEntry> 435 <AuditTrailEntry> 436 <Data> 437 <Attribute name= Status >rejected</attribute> 438 </Data> 439 <WorkflowModelElement>Evaluate claim</workflowmodelelement> 440 <EventType>complete</EventType> 441 <Timestamp> T 0 9 : 5 7 : : 0 0</Timestamp> 442 <Originator>Linda</Originator> 443 </AuditTrailEntry> 444 <AuditTrailEntry> 445 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 446 <EventType>s t a r t</eventtype> 447 <Timestamp> T 0 9 : 5 9 : : 0 0</Timestamp> 448 <Originator>Linda</Originator> 449 </AuditTrailEntry> 450 <AuditTrailEntry> 451 <WorkflowModelElement>Send rejection l e t t e r</workflowmodelelement> 452 <EventType>complete</EventType> 453 <Timestamp> T 1 0 : 0 1 : : 0 0</Timestamp> 454 <Originator>Linda</Originator> 455 </AuditTrailEntry> 456 <AuditTrailEntry> 457 <WorkflowModelElement>Archive claim</workflowmodelelement> 458 <EventType>s t a r t</eventtype> 459 <Timestamp> T 1 0 : 3 3 : : 0 0</Timestamp> 460 <Originator>Linda</Originator> 461 </AuditTrailEntry> 462 <AuditTrailEntry> 463 <WorkflowModelElement>Archive claim</workflowmodelelement> 464 <EventType>complete</EventType> 465 <Timestamp> T 1 0 : 5 6 : : 0 0</Timestamp> 466 <Originator>Linda</Originator> 467 </AuditTrailEntry> 468 </ProcessInstance> 469 </Process> 470 </WorkflowLog> 471 } 79

94 placeholder

95 Appendix B User Manual for the Social Network Mining Plug-in In this appendix we available an user manual in an effort to present further our plug-in - Organizational Miner Cluster plug-in. 81

96

97 User Manual for Organizational Miner Cluster plug-in an organizational mining tool implemented in ProM v6 Cládia Sofia da Costa Alves June 2010 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO

98 Contents 1 Introduction 4 2 Organizational Miner Cluster plug-in Getting Start Organizational Miner Cluster Tabs HCA Miner Modularity Social Network Organizational Network

99 1 Introduction This document describes how to use Organizational Miner Cluster plug-in available in ProM v.6. The Organizational perspective from Process Mining is a valuable technique that allows studying the social network of an organization. To do so, it provides means to make an evaluation of networks by mapping and analyzing relationships among people, teams, departments or even entire organizations. Organizational Miner Cluster plug-in, is a tool of Organizational Mining, which derives social networks from event logs generated by Process-aware Information Systems (PAIS). This tool aims to represent a social network not only at the individual level, but also at the organizational level. Individual level derives a flat model, which maps the relationships that exist among the different originators (relationships among people). Organizational level takes the previous analysis to a higher level of abstraction. It maps the relationships among groups or communities of originators. The organizational level is achieved using a new technique for identifying community structures in social networks. A community is defined as a set of nodes densely connected with one another; nodes that belong to the same group have a high level of similarity. Different communities are linked by sparse connections. To identify communities, Organizational Miner Cluster plug-in uses an Hierarchical Clustering Agglomerative, which not only help us identifying communities inside the network, but it also make simpler the representation and visualization of the huge amount of data required in this kind of analysis. The plug-in also adopts a new concept - Modularity - which is a quality measure for graph clustering. It measures if a specific division of a network into a group of communities is good or not, in the sense that the connections inside a community are dense and the connections between communities are sparse. The log file used all along this manual was insuranceclaimhandlingexample.mxml which is available with ProM v5.2

100 2 Organizational Miner Cluster plug-in 2.1 Getting Start After uploading a log file and choosing Organizational Miner Cluster plug-in, will appear a welcome panel. In this panel, user must choose the initial settings for the social network analysis. Figure 1 shows the initial panel, and as we can observe there are three main settings: 1) Miner Algorithm 2) Linkage Rule and 3) Tie with Modularity 1. Miner Algorithm This option allows the user to choose which kind of analysis he wants to perform. To derive social networks from event logs, different kinds of metrics have been developed: (1) metrics based on (possible) causality, (2) metrics based on joint cases, (3) metrics based on joint activities, and (4) metrics based on special event types [1]. From these, the Organizational Miner Cluster plug-in supports two of them: Working Together Algorithm is a metric based on joint activities. This plugin only count how frequently individuals work in the same case and map these relationships. Two individuals work together if they perform activities in the same case of an event log. Similar Tasks Algorithm is a metric based on joint activities. The main idea is to determine who performs the same type of activities. To do so, each individual has his own profile based on how frequent they conduct specific activities. Then the profiles are compared to determine the similarity between them. If two individuals have very similar profiles it is probably that they work in the same department, for example. 2. Linkage Rule This option refers to the Hierarchical Clustering Agglomerative Algorithm. Linkage Rules are approaches used to compute the distance between two different clusters. This plug-in implement three kinds of linkage rules: a) Single Linkage Rule b) Complete Linkage Rule c) Average Linkage Rule These approaches are further explained in Section Tie with Modularity This option also refers to the Hierarchical Clustering Agglomerative Algorithm (HAC). In each iteration of HAC the algorithm groups two clusters, and the two clusters chosen to group are the ones that are more similar. However, sometimes HAC finds more than one couple of clusters to group, which is very common in large networks. In this case were facing a situation of tie, and our algorithm can overcome this problem in two different ways:

101 Figure 1: Initial panel a) the algorithm will agglomerate the last pair found; b) the algorithm will use modularity concept. Given a set of couples of clusters with the same similarity, the algorithm will calculate modularity for each possible arrange of clusters, and will choose the one that maximizes the modularity value.

102 2.2 Organizational Miner Cluster Tabs Organizational Miner Cluster plug-in is composed by four tabs: tab 1 - HCA Algorithm, tab 2 - Modularity, tab 3 - Social and tab 4 - Organizational Network. Each one will be further explained in the following subsections. Each tab deals with different and specific information as it will be further discussed. For each tab, we first give some theory notions and basis and then we explain the features and settings of each tab HCA Miner Our plug-in uses Hierarchical Clustering Agglomerative, which means that it start from a single individual, and proceed to successive agglomeration of individuals until a community is found. HCA Algorithm Given a network with a set of N nodes, the basic process of the hierarchical clustering agglomerative algorithm adopted is as follows: 1. Each node is assigned to a cluster (if there are N nodes, will exist N clusters, each containing just one item). In this step distances (similarities) among clusters correspond to the power of the relationship of the nodes they contain; 2. Then the algorithm searches for the most powerful relationship between two clusters and merge them into a single cluster, so that now there is one cluster less. The calculus of similarity between two individual is calculated differently in Working Together and Similar Tasks metric. In Working Together metric the more often two individuals work together, greater is the power of the relationship between them and greater is their similarity. In Similar Tasks metric the more tasks two individuals perform in common, greater is the power of the relationship between them and greater is their similarity. a) If there are several candidates, i.e., more than a couple of clusters with the most powerful relationship, the decision of which candidates to agglomerate is made according with two options: (1) the algorithm choose the last couple of clusters found, or (2) the algorithm choose the couple of clusters that maximizes the modularity. 3. Compute distances (similarities) between the new cluster and each of the old clusters. For this step the algorithm may use one of these methods: single-linkage, complete-linkage or average-linkage; 4. Determine the value of modularity for this number of clusters; 5. Repeat steps 2, 3 and 4 until all items are clustered into a single cluster of size N. Step 3, the distance between two clusters, can be done in different ways in our approach:

103 Single Linkage Single linkage, also known as the nearest neighbour technique defines similarity between two clusters as the distance between the closest pair of elements of those clusters. In other words, the distance between two clusters is given by the value of the shortest link between the clusters. In the single linkage method, D(r, s) is computed as D(r, s) = Mind(i, j), where element i is in cluster r and element j is cluster s. Here the distance between every possible object pair (i, j) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance among clusters r and s. At each stage of hierarchical clustering, the clusters r and s, for which D(r, s) is minimum, are merged. Complete Linkage This method is similar to the previous one, but instead of considering the minimum value, it considers the maximum value. Complete Linkage computes the distance between two clusters as the distance between the two most distant elements in the two clusters. Average Linkage Here the distance between two clusters is defined as the average distance from any member of one cluster to any member of the other cluster. In the average linkage method, D(r, s) is computed as D(r, s) = T rs/(nr Ns). Where T rs is the sum of all pair wise distances between cluster r and cluster s. Nr and Ns are the sizes of the clusters r and s respectively. At each stage of hierarchical clustering, the clusters r and s, for which D(r, s) is the minimum, are merged. HCA features Figure 2 shows HCA Algorithm tab that is composed by three mains panels. Panel on the right-hand side gives an overview of executed plug-ins, in the figure we can see that plug-in Organizational Miner Clusters has been executed. Panel on the middle, depicted in Figure 3 shows the output of HCA Algorithm as a tree. The root node, coloured with orange represents the HCA algorithm, child nodes represented by blue are each iteration of HCA algorithm, and child nodes represented by green are the clusters achieved in each iteration. New clusters are presented in uppercase, for examples shall we observe the node corresponding to the 3th iteration, the Organic Unit 1 of this iteration is represented in uppercase because this Organization Unit results from the agglomerations of Organization Unit 1 and Organization Unit 4 from 2sd iteration. Finally the panel on the left-hand side is the setting panel where the user can configure the settings of HCA Algorithm and where some exploiting tools are available. We will now explain the panel on the left-hand side. HCA tab is the first one available and is used as the principal one. If the user wants to change the settings initially chosen

104 in the main panel, he is able to do that here. Figure 4 shows the settings available in this tab which we will now explain: 1. Social Metric - here the user is able to choose between Working Together metric and Similar Tasks metric; 2. Linkage Rule - this setting allows the user to choose one of the three linkage rules available: Single Linkage, Average Linkage and Complete Linkage 3. Tie break with modularity - This setting is used to choose how HCA will decide which candidates to choose in case of tie. If this setting is selected the HCA algorithm will choose the candidates that maximize modularity. Otherwise it will choose the last candidates found. 4. Show sub-network - this setting allows the user to analyze a sub-network, i.e., given set of clusters, the user can choose one in particular and analyze this cluster individually and analyze the nodes that compose this cluster. For example, let s assume that we want to analyze the Organization Unit 1 from the fourth iteration. Figure 4 shows the result, and as we can see this functionality gives us the output in two different ways: (1) as a matrix, were the value of each cell corresponds to the power of the relationship, (2) or as a graph were the power of the relationship is depicted as the link label. 5. Button Collapse All, collapses all the nodes of the tree shown in the middle panel and only shows the child s at level Button Expand All, expands all the nodes of the tree shown in the middle panel. 7. Button Calculate, calculates HCA Algorithm according to the options selected in the setting panel.

105 Figure 2: HAC tree Figure 3: HAC tab - middle panel

106 Figure 4: HAC tree settings Figure 5: HAC sub-network analyze

107 2.2.2 Modularity Modularity is a quality measure for graph clustering. It measures if a specific division of a network into a group of communities is good or not, in the sense that the connections inside a community are dense and the connections among communities is sparse. The needed to use this concept arrived because on of the most important and serious handicaps of Hierarchical Clustering Algorithms. All these kind of algorithms have the disadvantage of not providing any guidance of how many communities a network should be split into. For example, in a Agglomerative Clustering approach, the algorithm iterates from one element per cluster to a cluster containing all the elements, and the user does not now, which of the several iterations is the best one, the one that matches with the reality. To address this problem we adopted Modularity concept which has recently emerged [2, 3]. Next we will explain how the modularity is calculated. Definition of Modularity make some assumptions: To help explaining how to compute modularity, we will Assume a network composed of N vertices connected by m links or edges; Let Aij be an element of the adjacency matrix (symmetric matrix) of the network, which gives the number of edges between vertices i and j, i.e.; the power of the relationship between element i and element j. Finally, suppose we are given a candidate division of the vertices into Nc communities [4]. The modularity of this division is defined to be the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random. The fraction of the edges that fall within the given groups is expressed by A ij. And the expected number of edges falling between two vertices i and j, at random, is kikj/2m where ki is the degree of vertex i and kj is the degree of vertex j. Hence the actual minus expected number of edges between the same two vertices giver by the following equation: q = A ij k ik j (1) 2m Summing over all pairs of vertices in the same group, the modularity, denoted Q, is then given by the following equation: Q = 1 2m [ ij A ij k ik j 2m ] δ(c i, c j ) (2) Where ci is the group to which vertex i belongs and cj is the group to which vertex j belongs. δ(ci, cj) = 1 if vertex i and vertex j belong to the same cluster, and δ(ci, cj) = 0, if they belong to different clusters. The value of the modularity lies in the range [-1,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

108 Figure 6: Modularity chart Modularity features We will now explain the features available in Modularity Tab. In the settings panel we have three main features which are merely informative, as we can see in Figure Social metric informs the metric that was selected in the main panel or in HCA tab. Note that in this tab the user are not able to change this feature. 2. Linkage rules informs the linkage rule that was selected in the main panel or in HCA tab. Note that in this tab the user are not able to change this feature. 3. Information informs which iteration has the highest value and the respective value

109 Figure 7: Social Network tab Social Network Social Network tab, offers to the user, an individual perspective. This functionality of the plug-in, derives from the event log, a flat model where the user can analyze how individuals of the network are connected between one another. Figure 7 represents Social Network tab and as we can see, the network is depicted as a graph, where each node represents an individual and the links connecting nodes represent the relationship that exists between them. Each node is represented by one colour, and nodes from the same cluster are represented by the same colour. For example, in Figure 7 we can see the social network of the 2nd iteration of HCA Algorithm. Nodes Linda and Vincent are both represented by orange because they belong to the same cluster. This perspective allows us to see the relationship among originators, and to see which of them belong to the same clusters. However with this perspective we are not able to analyze the relationships among clusters. We overcome this issue with a second perspective - Organizational Network - which is available in the fourth and last tab. Social Network features As the other tab, this one is also composed by three panels. We will explain the settings panel, represented in Figure 8, and how each features work. 1. Social metric informs which metric was selected in the main panel or in HCA tab. Note that in this tab the user are not able to change this feature. 2. Linkage rules informs which linkage rule that was selected in the main panel or in HCA tab. Note that in this tab the user are not able to change this feature. 3. Layout: This feature available five algorithms use to draw undirected graphs. These algorithms calculate the layout of a graph only using information contained

110 within the structure of the graph itself, rather than relying on domain-specific knowledge. Graphs drawn with these algorithms tend to be aesthetically pleasing, exhibit symmetries, and tend to produce crossing-free layouts for planar graphs. The five different types of algorithms used in this plug-in are: KKLayout also known as Kamada-Kawai layout, attempts to position nodes on the space so that the geometric (Euclidean) distance among them is as close as possible to the graph-theoretic (path) distance among them. Circle layout: arranges all the node randomly into a circle, with constant spacing between each neighbour node. FRLayout also known as Fruchterman-Rheingold layout SpringLayout: it is a force-directed layout algorithm designed to simulate a system of particles each with some mass. The vertices simulate mass points repelling each other and the edges simulate springs with attracting forces. ISOMLayout: it a layout for self-organizing graphs. This is a neural network technique that arranges the data according to a low dimensional structure. The original data is partitioned into as many homonogeneous clusters as there are units, in such way that close clusters contain close data points in the original space. 4. Mouse Mode: this features has two options: (1) Transforming, which drags all the graph all around the screen; and (2) Picking, which drags only a specific node. 5. HCA Algorithm Iteration: shows the social network that corresponds to the iteration of HCA algorithm selected with the slide bar. 6. Remove Edges: with the slide bar the user can choose a threshold. All the edges with label above the threshold will not be drawn. 7. View Options show edge weights: if checked, edges are labelled with their weights. Otherwise, edges are not labelled. stroke highlight on selection: if checked, edges are drawn using thick solid lines, the thickness of each line is proportional to its weight. As bigger is the weight, thicker will be the line. group clusters: if checked, nodes belonging to the same cluster will be drawn closed to each other as we can see in Figure 8.

111 Figure 8: Social Network settings Figure 9: Group clusters

112 Figure 10: Organizational Network tab Organizational Network This perspective allows the user to identify teams in social network and to analyze the relationships that exists among those teams. Figure 10 represents Organizational Network tab and as we can see the network is depicted as a graph, where each node represents one team and the links connecting nodes represent the relationship that exists among them. Figure 10 corresponds to the Organizational perspective of Figure 7. Notice that nodes from the same cluster, in social perspective, have the same colour as their cluster, in the organizational perspective. For example, Linda and Vincent are represented with the colour orange as well is their cluster in the organizational perspective. The features of this tab are similar to the features available in Social Network Tab. References [1] W.M.P. van der Aalst, M. Song, Mining social networks: Uncovering interaction patterns in business processes, in: J. Desel, B. Pernici, M. Weske (Eds.), International Conference on Business Process Management (BPM 2004), Lecture Notes in Computer Science, vol. 3080, Springer, Berlin, 2004, pp [2] M. E. J. Newman, Detecting community structure in networks. Eur. Phys. J. B 38, (2004) [3] M. E. J. Newman, Modularity and community structure in networks. Proceedings of the National Academy of Science (USA), 103, (2006) [4] Clauset, A., Newman, M. E. J., and Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70:06111.