Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Transcription

1 Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture Lecture 2:... 4 Lecture Lecture Process mining part Lecture Lecture Lecture Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 1

2 Data Mining part Lecture 1 Data Mining The process of exploration and analysis by automatic or semi-automatic means of large quantities of data in order to discover meaningful patterns and rules. It is at the intersection of artificial intelligence, machine learning and database systems Data mining processes: 1. Classification Method to identify to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 2. Prediction/estimation Prediction methods are very similar to the classification but they try to predict the value of numerical variable rather than a class 3. Association Rules Association rules are an interesting data mining method for discovering relevant relations among variables contained in large databases. 4. Clustering A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. Supervised method: Uses data sets in which the value of the outcome of interests is known. ( classification & prediction) Unsupervised method: Uses data sets where the value of the outcome of interests in unknown (association rules & clustering) Data types: Training data: this set contains the data from which a classification or a prediction algorithms learns about the relationships between input variables and the outcome variables Validation data: Once the algorithm has learned from the training data, it is then applied to this sample of data (where the outcome is known) to see how well it does in comparison to other models Test data: If many data mining models are used, it is prudent to save a third sample of data with known outcomes to exploit with the model finally selected to predict how well it do. Knowledge discovery: When one does not know the information is there, but has means to analyze data.(importance of attributes is unclear, too much data, polluted data, results make no sense) K-Nearest neighbors classification method: Identifies k observation in the training set that are similar to a new record that we wish to classify the algorithm uses these similar records to classify the new record into a class. It assigns the new record to the predominant class among these neighbors. If (x1, x2,, xp) is the predictor of the new record to classify, the algorithm looks for records in the training data that are more similar (near) to (x1, x2,, xp) via the Euclidian distance. Depending on the value of K, an amount of classes is selected which are the nearest neighbors. If k is too small the approach can be too sensible to noise. If k is too large the approach can put the record to be classifies in a wrong class. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 2

3 Naïve Rule classification method: Method to classify a new record as a member of the majority class in the current training set while not taking information related to the input variables (predictors) into account. Is used as a baseline for evaluating the performance of more complicated classifiers Bayes Classification method: Classifies based on the computed probability of a record belonging in a given class not only by using the prevalence of that class but also by means of additional information on that record. It works only with predictors that are categorical (not numerical).it is based on the concept of conditional probability: P(H E)=P(H E)/ P(E) and on the Bayes theorem: P(H E)=P(E H) P(H)/P(E) Where H is a hypothesis to be tested and E is the evidence associated with the hypothesis. From the classification point of view H is the predicted class and E represents the values of input variables (predictor) P(H E) is the conditional probability that H is true given evidence E. P(H ) is a so-called a priori probability denoting the probability of the hypothesis before the presentation of any evidence. A significant problem with Bayes classifiers is when one the counts for an attribute value is zero. Instead of one performs with k: a value between 0 and 1 (usually) and p is chosen as a fraction part of the total number of possible values for the attribute. If the attribute can assume two value then p = 0.5. Another problem is missing data, this should just be not considered. Example: determine the sex based on magazine promotion= yes, watch promotion = yes, life insurance = no, credit card insurance = no. To determine the class: first build a pivot table and then calculate the probabilities: Figure 1: given information Figure 2: pivot table Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 3

4 Lecture 2: Figure 3: industry standard data mining framework Model is built up from: Structure: variables, inputs, outputs and types of relations amongst them Parameters: free variables after a structure is selected Search method: method with which the optimal parameters are identified Scoring functions: Scoring function How? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 4

5 Sum of squared errors Mean absolute error Classification confusion Matrix Estimated Misclassification Rate Accuracy Classification in unequal classes: ability of a classifier to classify the members of C0 correctly ability of a classifier to rule out the members of C1 correctly percentage of C1 members erroneously classified in C0 percentage of C0 members erroneously classified in C1 err= accuracy=1 err sensitivity= specificity= false positive rate= false negative rate= Mean Squared Error Root Mean squared error Variance accounted for Origin of Errors Experimental errors: inherent to data due to noise, method of data collection etc. Sample error: errors due to sampling the population Model error: error due to misfit of selected model class Algorithmic error: error due to inability of algorithm to find the correct solution Why not select the model that returns the least error? Balance needed between right model structure and right parameters by using a single measure. Fitting the data does not mean you fit the underlying function. Trade of between flexibility and model performance. Data split: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 5

6 2/3 training set & 1/3 testing set, should be non overlapping. Training set: used to determine model parameters Testing set: used to estimate model performance Cross Validation Split data into multiple, non-overlapping subsets Use multiple estimations instead of a single estimate K-fold cross-validation (e.g. k=10) Divide data into k sets. Use one set for test and remainder for training. Average of k model errors is assumed as model error. May repeat cross-validation 10 times to reduce variance Leave-One-Out cross validation: Involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. How to choose one data mining method over another one? Estimate cross-validation results with first method. Estimate cross-validation results with second method make sure the same cross-validation data sets are used. Apply paired Student s t-test to determine whether population means differ significantly Occam s razor: The best theory is the smallest one that describes all the facts. Association Rule: given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently handled together. Association rules model this information by means of IF THEN rule. The goal of association rules is to identify rules that indicate a strong dependence between antecedent and consequent: confidence Confidence = Cardinality: a measure of the "number of elements of the set. Cardinality = 1: If white then red; cardinality = 2: If white and red then green. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 6

7 Naïve algorithm association rule: Generates all the rules that would be candidates for indicating association between items. All possible combinations of items in a database with p distinct elements (in our example p=6). The algorithms should find all combinations of single items, pair of items, triplets of items and so on. In order to reduce the computational time, a good algorithm should generate only the combinations with higher frequency in the database: Frequent item sets. Support of Rule: the number of transactions in a DB that include both the antecedent and consequent of a rule. Sometime expressed in percentage. F.I. support for {red, white} is 4 or 100 x 4/10 = 40% Apriori algorithm Initially generates a frequent item set with just one item. Successively generate two item frequent set item, three item frequent set item and so on, by discarding item sets with have a support below a desired minimum support. In general, generating k-item sets uses the frequent (k-1)-item sets. Figure 4: apriori algorithm Lecture 3 (Artificial + Biological) Neural Network: Highly parallel models of the brain and nervous system that process information much more like the brain than a serial computer. These models are adaptive systems and can change its structure during a learning phase. Artificial neural networks are suitable for classification and prediction since they are good at extracting and recognizing patterns(the style) and generalize from the already seen to make predictions. Biological Neural Nets: (Pidgeon as art expert ) Neural net that uses neurons and synapse for its internal structure. Artificial Neural Nets: Uses nodes and weights for its internal structure. Feeding data through the net: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 7

8 Each node computes a weighted sum of the inputs and apply a certain function on it. For a set of input values x1,x2,,xp, the output value at node j is g(θj+ wijxi). θj is called bias of node j and it is a constant value the controls the level of contribution of node j. g is the so called transfer function (e.g. squash function) Figure 5: feeding data throughthe net with squashing function Normalizing data: Neural Networks perform best when predictors and response variables are on a scale of [0,1] numerical variable X in the range [a, b]: categorical data a choice of m fractions in [0,1] [0, 0.25, 0.5, 1] Training the model: Estimate the values θj and wij to lead to the best predictive results. Compute the neural network output for each row in the training set.the model produces a prediction which is then compared with the actual response value. Their difference is the error for the output node. This error is used, iteratively, for estimating the weights and bias. One uses a hidden layer which includes nodes, weights and biases. What is visible are the input and output values. Back propagation of error: Method which updates weights and bias values based on the error of the output and starts with the last output of the network. The updating stops whenever new weights are only incrementally different from those of the proceding iteration, when the misclassification rate reaches a required threshold, or when the limit on the number of runs is reached. Each updating iteration is named an epoch. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 8

9 With as the ouput from the output node k and l, as the learning rate. Using a learning rate of 0.5 the weights can be updated: And the backpropagation is as follows: Lecture 4 Classification tree: a predictive model which maps observations about an item to conclusions about the item's target value. The goal is to create a model that predicts the value of a target variable based on several input variables, by representing a graphical view of classification rules. Classification trees have a double level of simplicity both simple for the analyst as for the customers. The square terminal nodes are marked with 0(Non acceptor) or 1 acceptor. Each circle node represents a decision on a given predictor. Each path in the tree can be simply translated in a rule for instance: IF (Income > 92.5) AND (Education < 1.5) AND (Family<=2.5) THEN Class = 0 Recursive Partitioning: For a given variable xi a given value si is chosen to split the p-dimensional space in two parts Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 9

10 One part contains all the points with x i <= s i and the other part contains the points with x i > s i. The process continues until pure rectangles are obtained which are rectangles that contain only points belonging to a given class. The ideal splitting value should reduce impurity(heterogeneity) in resulting rectangles. Measuring the impurity of a rectangle can be done with: Or the entropy measure: Pruning: A common strategy which has the tree grow until each node contains a small number of instances. Then use pruning to remove nodes that do not provide additional information. Clustering: A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. The results are used to get insight into data distribution or as a preprocessing step for other algorithms. Clustering has a very broad application base such as real value attrbutes, binary attributes, nominal(categorical) attributes, ordinal/ranked attributes or variables of mixed types. If all d dimensions are real-valued then we can visualize each data point as points in a d- dimensional space. If all d dimensions are binary then we can think of each data point as a binary vector. Clustering within cluster: - K-Means: given a set X of n point in a d-dimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the d-dimensional space to form clusters {C1,C2,,Ck} such that is minimized. One way of solving the k-means problem: randomly pick k cluster centers {c1,c2,,ck}. For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i j. For each i let ci be the center of cluster Ci (mean of the vectors in Ci) Repeat until convergence. - K-Median: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 10

11 given a set X of n point in a d-dimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the d-dimensional space to form clusters {C1,C2,,Ck} such that is minimized. - K-Center/Partitioning around medoids: Choose randomly k medoids from the original dataset X. Assign each of the n-k remaining points to their closest medoid. Iteratively replace one of the medoids by one of the non-medoids of it improves the total clustering cost. Distance function for binary vectors: Jaccard similarity: Jaccard distance: 1- JSim(X,Y) Distance functions for real-valued vectors: L p norm with p: a positive integer: If p=1, then L 1 is the Manhattan distance: If p=2, then L 2 is the Euclidian distance: Outliers: Objects that do not belong to any cluster or form clusters of very small cardinality. Hierarchical clustering: Produces a set of nested clusters organized as a hierarchical tree. Hierarchical clustering can be visualized as a dendogram. No assumptions are made on the number of clusters: any number can be achieved by cutting the dendogram at the proper level. There are two types of hierarchical clustering: agglomerative of divisive. Agglomerative hierarchical clustering: 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains Dendogram: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 11

12 A tree-like diagram that records the sequences of merges or splits. Figure 6: dendogram Distances: Single link distance between clusters Ci and Cj: The minimum distance between any object in Ci and any object in Cj. Drawback: sensitive to noise, produces long clusters Complete link distance between clusters Ci and Cj: The maximum distance between any object in Ci and any object in Cj. Drawback: tends to break large clusters, alle clusters tend to have the same diameter at first. Group average distance between clusters Ci and Cj: The average distance between any object in Ci and any object in Cj. Is less susceptible to noise and outliers. Limitations towards globular clusters. Centroid distance between clusters Ci and Cj: the distance between the centroid ri of Ci and the centroid rj of Cj Ward s distance between clusters Ci and Cj: The difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij. Is less susceptible to noise and outliers but biased towards globular clusters. Divisive hierarchical clustering: Start with a single cluster composed of all data points Split this into components Continue recursively Monothetic divisive methods split clusters using one variable/dimension at a time Polythetic divisive methods make splits on the basis of all variables together Any intercluster distance measure can be used Drawback: computationally intensive, less widely used than agglomerative methods Expectation Maximization Algorithm: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 12

13 Initialize k distribution parameters (θ1,,θk);each distribution parameter corresponds to a cluster center. Iterate between two steps: Expectation step: (probabilistically) assign points to clusters. Maximization step: estimate model parameters that maximize the likelihood for the given assignment of points. Process mining part Lecture 5 Data mining (sometimes called data or knowledge discovery): the process of analyzing data from different perspectives and summarizing it into useful information, information that can be used to increase revenue, cuts costs, or both. Process Mining: the extraction of non-trivial information from a registration what happens during the execution of a process (a so called event log). Information about what really happens within a process, not what the owners/managers think what happens within the process (objective vs. subjective information). Can be applied for performance analysis, auditing/security, organizational models and process models. Issues that may hamper the application of process mining: Event data is of low / poor / bad quality; Not everything is logged: some process steps are not recorded or time information is missing or to roughly. A lot of effort is required to get / prepare the data. Assumptions about event logs A process consists of cases. A case consists of events such that each event relates to precisely one case. Events within a case are ordered. Events can have attributes. Examples of typical attribute names are activity, time, costs, and resource. Flattering reality into event logs: In order to perform process mining, events need to be related to cases. However, in real life processes are not flat (see order example in Section 4.4 of the book). In a hospital context Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 13

14 the case can be the treatments of a patient, the tasks of a nurse, etc. In a real process the flattered sub processes are intertwined. Alpha algorithm: Direct succession: x>y iff for some case x is directly followed by y. Causality: x y iff x>y and not y>x. Parallel: x y iff x>y and y>x Choice: x#y iff not x>y and not y>x. Ti = start To = end Tw = total Y contains the places Pw contains the arrows There are no free choice constructs: only direct following relations are used in the alpha miner. Challenges are noise such as: Hidden tasks, Duplicate tasks, Non-free-choice constructs, Loops, Mining and exploiting time, Dealing with noise, Dealing with incompleteness General Mining issues: Quality of the (learning) data Generalization: data mining: not only a correct classification of the cases in the learning material, but also a correct classification of new cases. In process mining, an unlimited number of traces can be possible, so generalization is difficult. Overfitting Noise Model representation bias Search technique bias: Quality of mined models: Parameter optimalization using k fold cross validation + t-test Classification: % of correct classified cases Estimation: Mean Squared Error = Performance measure always on NEW-material 1 n n i 1 ( t i x i ) 2 Quality of the mined model: Fitness: is the observed behavior captured by the model? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 14

15 Precision: does the model only allow for behavior that happens in reality? Generalization: does the model allow for more behavior than encountered in reality? Structure/Simplicity: does the model have a minimal structure to describe the behavior (easy to understand models)? Quality of Data Mining model One measure Measures independent from mining technique Benchmark datasets Parameter optimalization Quality of Process Mining model Combination of 3 measures (generalization is missing) Strongly connected with Petri net formalism (theoretical strong, practical ) Many event logs but not a clear benchmark set Parameter optimalization unclear Lecture 6 Sometimes one has to model very extensive and big models and these will look like spaghetti. In situations with low-structured domains, for instance health care, noise and low frequent behavior, one has to use different heuristics that are more flexible, focus on main behavior and take the frequency of the behavior in the event log into account. Examples are the Flexible Heuristics Miner, the Genetic Miner or the Fuzzy Miner. Flexible Heuristic miner: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 15

16 Five different types of noise generating operations: (i) delete the head of a trace, (ii) delete the tail of a trace, (iii) delete a part of the body, (iv) remove one event, and finally (v) interchange two random chosen events. Fuzzy miner: Roadmap principle in process mining: Emphasys: Level of detail addapted to purpose Customization: Significant information highlighted by visual means Abstraction: Low level information omitted Aggregation: Clusters of low level detail information Uses internal filters: Concurrency filter, Edge filter, Node Filter Lecture 7 Crisp: Cross industry standard process for data mining: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment Data mining overview Goal is the mining of a prediction model The model is employed in business process Process mining overview Goal is a better understanding of the business process The knowledge is used to improve the process. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 16