Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen


 Edwin Parker
 3 years ago
 Views:
Transcription
1 Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture Lecture 2:... 4 Lecture Lecture Process mining part Lecture Lecture Lecture Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 1
2 Data Mining part Lecture 1 Data Mining The process of exploration and analysis by automatic or semiautomatic means of large quantities of data in order to discover meaningful patterns and rules. It is at the intersection of artificial intelligence, machine learning and database systems Data mining processes: 1. Classification Method to identify to which of a set of categories (subpopulations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 2. Prediction/estimation Prediction methods are very similar to the classification but they try to predict the value of numerical variable rather than a class 3. Association Rules Association rules are an interesting data mining method for discovering relevant relations among variables contained in large databases. 4. Clustering A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. Supervised method: Uses data sets in which the value of the outcome of interests is known. ( classification & prediction) Unsupervised method: Uses data sets where the value of the outcome of interests in unknown (association rules & clustering) Data types: Training data: this set contains the data from which a classification or a prediction algorithms learns about the relationships between input variables and the outcome variables Validation data: Once the algorithm has learned from the training data, it is then applied to this sample of data (where the outcome is known) to see how well it does in comparison to other models Test data: If many data mining models are used, it is prudent to save a third sample of data with known outcomes to exploit with the model finally selected to predict how well it do. Knowledge discovery: When one does not know the information is there, but has means to analyze data.(importance of attributes is unclear, too much data, polluted data, results make no sense) KNearest neighbors classification method: Identifies k observation in the training set that are similar to a new record that we wish to classify the algorithm uses these similar records to classify the new record into a class. It assigns the new record to the predominant class among these neighbors. If (x1, x2,, xp) is the predictor of the new record to classify, the algorithm looks for records in the training data that are more similar (near) to (x1, x2,, xp) via the Euclidian distance. Depending on the value of K, an amount of classes is selected which are the nearest neighbors. If k is too small the approach can be too sensible to noise. If k is too large the approach can put the record to be classifies in a wrong class. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 2
3 Naïve Rule classification method: Method to classify a new record as a member of the majority class in the current training set while not taking information related to the input variables (predictors) into account. Is used as a baseline for evaluating the performance of more complicated classifiers Bayes Classification method: Classifies based on the computed probability of a record belonging in a given class not only by using the prevalence of that class but also by means of additional information on that record. It works only with predictors that are categorical (not numerical).it is based on the concept of conditional probability: P(H E)=P(H E)/ P(E) and on the Bayes theorem: P(H E)=P(E H) P(H)/P(E) Where H is a hypothesis to be tested and E is the evidence associated with the hypothesis. From the classification point of view H is the predicted class and E represents the values of input variables (predictor) P(H E) is the conditional probability that H is true given evidence E. P(H ) is a socalled a priori probability denoting the probability of the hypothesis before the presentation of any evidence. A significant problem with Bayes classifiers is when one the counts for an attribute value is zero. Instead of one performs with k: a value between 0 and 1 (usually) and p is chosen as a fraction part of the total number of possible values for the attribute. If the attribute can assume two value then p = 0.5. Another problem is missing data, this should just be not considered. Example: determine the sex based on magazine promotion= yes, watch promotion = yes, life insurance = no, credit card insurance = no. To determine the class: first build a pivot table and then calculate the probabilities: Figure 1: given information Figure 2: pivot table Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 3
4 Lecture 2: Figure 3: industry standard data mining framework Model is built up from: Structure: variables, inputs, outputs and types of relations amongst them Parameters: free variables after a structure is selected Search method: method with which the optimal parameters are identified Scoring functions: Scoring function How? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 4
5 Sum of squared errors Mean absolute error Classification confusion Matrix Estimated Misclassification Rate Accuracy Classification in unequal classes: ability of a classifier to classify the members of C0 correctly ability of a classifier to rule out the members of C1 correctly percentage of C1 members erroneously classified in C0 percentage of C0 members erroneously classified in C1 err= accuracy=1 err sensitivity= specificity= false positive rate= false negative rate= Mean Squared Error Root Mean squared error Variance accounted for Origin of Errors Experimental errors: inherent to data due to noise, method of data collection etc. Sample error: errors due to sampling the population Model error: error due to misfit of selected model class Algorithmic error: error due to inability of algorithm to find the correct solution Why not select the model that returns the least error? Balance needed between right model structure and right parameters by using a single measure. Fitting the data does not mean you fit the underlying function. Trade of between flexibility and model performance. Data split: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 5
6 2/3 training set & 1/3 testing set, should be non overlapping. Training set: used to determine model parameters Testing set: used to estimate model performance Cross Validation Split data into multiple, nonoverlapping subsets Use multiple estimations instead of a single estimate Kfold crossvalidation (e.g. k=10) Divide data into k sets. Use one set for test and remainder for training. Average of k model errors is assumed as model error. May repeat crossvalidation 10 times to reduce variance LeaveOneOut cross validation: Involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. How to choose one data mining method over another one? Estimate crossvalidation results with first method. Estimate crossvalidation results with second method make sure the same crossvalidation data sets are used. Apply paired Student s ttest to determine whether population means differ significantly Occam s razor: The best theory is the smallest one that describes all the facts. Association Rule: given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently handled together. Association rules model this information by means of IF THEN rule. The goal of association rules is to identify rules that indicate a strong dependence between antecedent and consequent: confidence Confidence = Cardinality: a measure of the "number of elements of the set. Cardinality = 1: If white then red; cardinality = 2: If white and red then green. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 6
7 Naïve algorithm association rule: Generates all the rules that would be candidates for indicating association between items. All possible combinations of items in a database with p distinct elements (in our example p=6). The algorithms should find all combinations of single items, pair of items, triplets of items and so on. In order to reduce the computational time, a good algorithm should generate only the combinations with higher frequency in the database: Frequent item sets. Support of Rule: the number of transactions in a DB that include both the antecedent and consequent of a rule. Sometime expressed in percentage. F.I. support for {red, white} is 4 or 100 x 4/10 = 40% Apriori algorithm Initially generates a frequent item set with just one item. Successively generate two item frequent set item, three item frequent set item and so on, by discarding item sets with have a support below a desired minimum support. In general, generating kitem sets uses the frequent (k1)item sets. Figure 4: apriori algorithm Lecture 3 (Artificial + Biological) Neural Network: Highly parallel models of the brain and nervous system that process information much more like the brain than a serial computer. These models are adaptive systems and can change its structure during a learning phase. Artificial neural networks are suitable for classification and prediction since they are good at extracting and recognizing patterns(the style) and generalize from the already seen to make predictions. Biological Neural Nets: (Pidgeon as art expert ) Neural net that uses neurons and synapse for its internal structure. Artificial Neural Nets: Uses nodes and weights for its internal structure. Feeding data through the net: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 7
8 Each node computes a weighted sum of the inputs and apply a certain function on it. For a set of input values x1,x2,,xp, the output value at node j is g(θj+ wijxi). θj is called bias of node j and it is a constant value the controls the level of contribution of node j. g is the so called transfer function (e.g. squash function) Figure 5: feeding data throughthe net with squashing function Normalizing data: Neural Networks perform best when predictors and response variables are on a scale of [0,1] numerical variable X in the range [a, b]: categorical data a choice of m fractions in [0,1] [0, 0.25, 0.5, 1] Training the model: Estimate the values θj and wij to lead to the best predictive results. Compute the neural network output for each row in the training set.the model produces a prediction which is then compared with the actual response value. Their difference is the error for the output node. This error is used, iteratively, for estimating the weights and bias. One uses a hidden layer which includes nodes, weights and biases. What is visible are the input and output values. Back propagation of error: Method which updates weights and bias values based on the error of the output and starts with the last output of the network. The updating stops whenever new weights are only incrementally different from those of the proceding iteration, when the misclassification rate reaches a required threshold, or when the limit on the number of runs is reached. Each updating iteration is named an epoch. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 8
9 With as the ouput from the output node k and l, as the learning rate. Using a learning rate of 0.5 the weights can be updated: And the backpropagation is as follows: Lecture 4 Classification tree: a predictive model which maps observations about an item to conclusions about the item's target value. The goal is to create a model that predicts the value of a target variable based on several input variables, by representing a graphical view of classification rules. Classification trees have a double level of simplicity both simple for the analyst as for the customers. The square terminal nodes are marked with 0(Non acceptor) or 1 acceptor. Each circle node represents a decision on a given predictor. Each path in the tree can be simply translated in a rule for instance: IF (Income > 92.5) AND (Education < 1.5) AND (Family<=2.5) THEN Class = 0 Recursive Partitioning: For a given variable xi a given value si is chosen to split the pdimensional space in two parts Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 9
10 One part contains all the points with x i <= s i and the other part contains the points with x i > s i. The process continues until pure rectangles are obtained which are rectangles that contain only points belonging to a given class. The ideal splitting value should reduce impurity(heterogeneity) in resulting rectangles. Measuring the impurity of a rectangle can be done with: Or the entropy measure: Pruning: A common strategy which has the tree grow until each node contains a small number of instances. Then use pruning to remove nodes that do not provide additional information. Clustering: A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. The results are used to get insight into data distribution or as a preprocessing step for other algorithms. Clustering has a very broad application base such as real value attrbutes, binary attributes, nominal(categorical) attributes, ordinal/ranked attributes or variables of mixed types. If all d dimensions are realvalued then we can visualize each data point as points in a d dimensional space. If all d dimensions are binary then we can think of each data point as a binary vector. Clustering within cluster:  KMeans: given a set X of n point in a ddimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the ddimensional space to form clusters {C1,C2,,Ck} such that is minimized. One way of solving the kmeans problem: randomly pick k cluster centers {c1,c2,,ck}. For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i j. For each i let ci be the center of cluster Ci (mean of the vectors in Ci) Repeat until convergence.  KMedian: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 10
11 given a set X of n point in a ddimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the ddimensional space to form clusters {C1,C2,,Ck} such that is minimized.  KCenter/Partitioning around medoids: Choose randomly k medoids from the original dataset X. Assign each of the nk remaining points to their closest medoid. Iteratively replace one of the medoids by one of the nonmedoids of it improves the total clustering cost. Distance function for binary vectors: Jaccard similarity: Jaccard distance: 1 JSim(X,Y) Distance functions for realvalued vectors: L p norm with p: a positive integer: If p=1, then L 1 is the Manhattan distance: If p=2, then L 2 is the Euclidian distance: Outliers: Objects that do not belong to any cluster or form clusters of very small cardinality. Hierarchical clustering: Produces a set of nested clusters organized as a hierarchical tree. Hierarchical clustering can be visualized as a dendogram. No assumptions are made on the number of clusters: any number can be achieved by cutting the dendogram at the proper level. There are two types of hierarchical clustering: agglomerative of divisive. Agglomerative hierarchical clustering: 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains Dendogram: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 11
12 A treelike diagram that records the sequences of merges or splits. Figure 6: dendogram Distances: Single link distance between clusters Ci and Cj: The minimum distance between any object in Ci and any object in Cj. Drawback: sensitive to noise, produces long clusters Complete link distance between clusters Ci and Cj: The maximum distance between any object in Ci and any object in Cj. Drawback: tends to break large clusters, alle clusters tend to have the same diameter at first. Group average distance between clusters Ci and Cj: The average distance between any object in Ci and any object in Cj. Is less susceptible to noise and outliers. Limitations towards globular clusters. Centroid distance between clusters Ci and Cj: the distance between the centroid ri of Ci and the centroid rj of Cj Ward s distance between clusters Ci and Cj: The difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij. Is less susceptible to noise and outliers but biased towards globular clusters. Divisive hierarchical clustering: Start with a single cluster composed of all data points Split this into components Continue recursively Monothetic divisive methods split clusters using one variable/dimension at a time Polythetic divisive methods make splits on the basis of all variables together Any intercluster distance measure can be used Drawback: computationally intensive, less widely used than agglomerative methods Expectation Maximization Algorithm: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 12
13 Initialize k distribution parameters (θ1,,θk);each distribution parameter corresponds to a cluster center. Iterate between two steps: Expectation step: (probabilistically) assign points to clusters. Maximization step: estimate model parameters that maximize the likelihood for the given assignment of points. Process mining part Lecture 5 Data mining (sometimes called data or knowledge discovery): the process of analyzing data from different perspectives and summarizing it into useful information, information that can be used to increase revenue, cuts costs, or both. Process Mining: the extraction of nontrivial information from a registration what happens during the execution of a process (a so called event log). Information about what really happens within a process, not what the owners/managers think what happens within the process (objective vs. subjective information). Can be applied for performance analysis, auditing/security, organizational models and process models. Issues that may hamper the application of process mining: Event data is of low / poor / bad quality; Not everything is logged: some process steps are not recorded or time information is missing or to roughly. A lot of effort is required to get / prepare the data. Assumptions about event logs A process consists of cases. A case consists of events such that each event relates to precisely one case. Events within a case are ordered. Events can have attributes. Examples of typical attribute names are activity, time, costs, and resource. Flattering reality into event logs: In order to perform process mining, events need to be related to cases. However, in real life processes are not flat (see order example in Section 4.4 of the book). In a hospital context Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 13
14 the case can be the treatments of a patient, the tasks of a nurse, etc. In a real process the flattered sub processes are intertwined. Alpha algorithm: Direct succession: x>y iff for some case x is directly followed by y. Causality: x y iff x>y and not y>x. Parallel: x y iff x>y and y>x Choice: x#y iff not x>y and not y>x. Ti = start To = end Tw = total Y contains the places Pw contains the arrows There are no free choice constructs: only direct following relations are used in the alpha miner. Challenges are noise such as: Hidden tasks, Duplicate tasks, Nonfreechoice constructs, Loops, Mining and exploiting time, Dealing with noise, Dealing with incompleteness General Mining issues: Quality of the (learning) data Generalization: data mining: not only a correct classification of the cases in the learning material, but also a correct classification of new cases. In process mining, an unlimited number of traces can be possible, so generalization is difficult. Overfitting Noise Model representation bias Search technique bias: Quality of mined models: Parameter optimalization using k fold cross validation + ttest Classification: % of correct classified cases Estimation: Mean Squared Error = Performance measure always on NEWmaterial 1 n n i 1 ( t i x i ) 2 Quality of the mined model: Fitness: is the observed behavior captured by the model? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 14
15 Precision: does the model only allow for behavior that happens in reality? Generalization: does the model allow for more behavior than encountered in reality? Structure/Simplicity: does the model have a minimal structure to describe the behavior (easy to understand models)? Quality of Data Mining model One measure Measures independent from mining technique Benchmark datasets Parameter optimalization Quality of Process Mining model Combination of 3 measures (generalization is missing) Strongly connected with Petri net formalism (theoretical strong, practical ) Many event logs but not a clear benchmark set Parameter optimalization unclear Lecture 6 Sometimes one has to model very extensive and big models and these will look like spaghetti. In situations with lowstructured domains, for instance health care, noise and low frequent behavior, one has to use different heuristics that are more flexible, focus on main behavior and take the frequency of the behavior in the event log into account. Examples are the Flexible Heuristics Miner, the Genetic Miner or the Fuzzy Miner. Flexible Heuristic miner: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 15
16 Five different types of noise generating operations: (i) delete the head of a trace, (ii) delete the tail of a trace, (iii) delete a part of the body, (iv) remove one event, and finally (v) interchange two random chosen events. Fuzzy miner: Roadmap principle in process mining: Emphasys: Level of detail addapted to purpose Customization: Significant information highlighted by visual means Abstraction: Low level information omitted Aggregation: Clusters of low level detail information Uses internal filters: Concurrency filter, Edge filter, Node Filter Lecture 7 Crisp: Cross industry standard process for data mining: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment Data mining overview Goal is the mining of a prediction model The model is employed in business process Process mining overview Goal is a better understanding of the business process The knowledge is used to improve the process. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 16
Data Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationComparison of Kmeans and Backpropagation Data Mining Algorithms
Comparison of Kmeans and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototypebased Fuzzy cmeans Mixture Model Clustering Densitybased
More informationLearning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.
Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationKMeans Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
KMeans Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More information6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of
Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationEFFICIENT DATA PREPROCESSING FOR DATA MINING
EFFICIENT DATA PREPROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationMachine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More information6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationChapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining BecerraFernandez, et al.  Knowledge Management 1/e  2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationAn Introduction to Cluster Analysis for Data Mining
An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationCluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009
Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative Kmeans Densitybased Interpretation
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationCHAPTER 3 DATA MINING AND CLUSTERING
CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More information15.564 Information Technology I. Business Intelligence
15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses
More informationIntroduction to Statistical Machine Learning
CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationAn Introduction to Data Mining
An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks YoungRae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationDiscovering process models from empirical data
Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationCluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 20150305
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 20150305 Roman Kern (KTI, TU Graz) Ensemble Methods 20150305 1 / 38 Outline 1 Introduction 2 Classification
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data
More informationRole of Neural network in data mining
Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)
More informationUsing Trace Clustering for Configurable Process Discovery Explained by Event Log Data
Master of Business Information Systems, Department of Mathematics and Computer Science Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master Thesis Author: ing. Y.P.J.M.
More informationData Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationD A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, MayJun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationIntroduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 538 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning?  more than just memorizing
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationData Mining and Neural Networks in Stata
Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di MilanoBicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationFlat Clustering KMeans Algorithm
Flat Clustering KMeans Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but
More informationAutomatic Web Page Classification
Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory
More informationUnsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning
Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets Knearest neighbor Support vectors Linear regression Logistic regression...
More informationOutlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationLecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM ThanhNghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationImpelling Heart Attack Prediction System using Data Mining and Artificial Neural Network
General Article International Journal of Current Engineering and Technology EISSN 2277 4106, PISSN 23475161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Impelling
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationData Mining and Clustering Techniques
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationKnowledge Discovery and Data Mining. Structured vs. NonStructured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. NonStructured Data Most business databases contain structured data consisting of welldefined fields with numeric or alphanumeric values.
More informationDATA MINING TECHNIQUES
DATA MINING TECHNIQUES Mohammed J. Zaki Department of Computer Science, Rensselaer Polytechnic Institute Troy, New York 121803590, USA Email: zaki@cs.rpi.edu Limsoon Wong Institute for Infocomm Research
More information