Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen"

Transcription

1 Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture Lecture 2:... 4 Lecture Lecture Process mining part Lecture Lecture Lecture Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 1

2 Data Mining part Lecture 1 Data Mining The process of exploration and analysis by automatic or semi-automatic means of large quantities of data in order to discover meaningful patterns and rules. It is at the intersection of artificial intelligence, machine learning and database systems Data mining processes: 1. Classification Method to identify to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. 2. Prediction/estimation Prediction methods are very similar to the classification but they try to predict the value of numerical variable rather than a class 3. Association Rules Association rules are an interesting data mining method for discovering relevant relations among variables contained in large databases. 4. Clustering A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. Supervised method: Uses data sets in which the value of the outcome of interests is known. ( classification & prediction) Unsupervised method: Uses data sets where the value of the outcome of interests in unknown (association rules & clustering) Data types: Training data: this set contains the data from which a classification or a prediction algorithms learns about the relationships between input variables and the outcome variables Validation data: Once the algorithm has learned from the training data, it is then applied to this sample of data (where the outcome is known) to see how well it does in comparison to other models Test data: If many data mining models are used, it is prudent to save a third sample of data with known outcomes to exploit with the model finally selected to predict how well it do. Knowledge discovery: When one does not know the information is there, but has means to analyze data.(importance of attributes is unclear, too much data, polluted data, results make no sense) K-Nearest neighbors classification method: Identifies k observation in the training set that are similar to a new record that we wish to classify the algorithm uses these similar records to classify the new record into a class. It assigns the new record to the predominant class among these neighbors. If (x1, x2,, xp) is the predictor of the new record to classify, the algorithm looks for records in the training data that are more similar (near) to (x1, x2,, xp) via the Euclidian distance. Depending on the value of K, an amount of classes is selected which are the nearest neighbors. If k is too small the approach can be too sensible to noise. If k is too large the approach can put the record to be classifies in a wrong class. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 2

3 Naïve Rule classification method: Method to classify a new record as a member of the majority class in the current training set while not taking information related to the input variables (predictors) into account. Is used as a baseline for evaluating the performance of more complicated classifiers Bayes Classification method: Classifies based on the computed probability of a record belonging in a given class not only by using the prevalence of that class but also by means of additional information on that record. It works only with predictors that are categorical (not numerical).it is based on the concept of conditional probability: P(H E)=P(H E)/ P(E) and on the Bayes theorem: P(H E)=P(E H) P(H)/P(E) Where H is a hypothesis to be tested and E is the evidence associated with the hypothesis. From the classification point of view H is the predicted class and E represents the values of input variables (predictor) P(H E) is the conditional probability that H is true given evidence E. P(H ) is a so-called a priori probability denoting the probability of the hypothesis before the presentation of any evidence. A significant problem with Bayes classifiers is when one the counts for an attribute value is zero. Instead of one performs with k: a value between 0 and 1 (usually) and p is chosen as a fraction part of the total number of possible values for the attribute. If the attribute can assume two value then p = 0.5. Another problem is missing data, this should just be not considered. Example: determine the sex based on magazine promotion= yes, watch promotion = yes, life insurance = no, credit card insurance = no. To determine the class: first build a pivot table and then calculate the probabilities: Figure 1: given information Figure 2: pivot table Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 3

4 Lecture 2: Figure 3: industry standard data mining framework Model is built up from: Structure: variables, inputs, outputs and types of relations amongst them Parameters: free variables after a structure is selected Search method: method with which the optimal parameters are identified Scoring functions: Scoring function How? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 4

5 Sum of squared errors Mean absolute error Classification confusion Matrix Estimated Misclassification Rate Accuracy Classification in unequal classes: ability of a classifier to classify the members of C0 correctly ability of a classifier to rule out the members of C1 correctly percentage of C1 members erroneously classified in C0 percentage of C0 members erroneously classified in C1 err= accuracy=1 err sensitivity= specificity= false positive rate= false negative rate= Mean Squared Error Root Mean squared error Variance accounted for Origin of Errors Experimental errors: inherent to data due to noise, method of data collection etc. Sample error: errors due to sampling the population Model error: error due to misfit of selected model class Algorithmic error: error due to inability of algorithm to find the correct solution Why not select the model that returns the least error? Balance needed between right model structure and right parameters by using a single measure. Fitting the data does not mean you fit the underlying function. Trade of between flexibility and model performance. Data split: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 5

6 2/3 training set & 1/3 testing set, should be non overlapping. Training set: used to determine model parameters Testing set: used to estimate model performance Cross Validation Split data into multiple, non-overlapping subsets Use multiple estimations instead of a single estimate K-fold cross-validation (e.g. k=10) Divide data into k sets. Use one set for test and remainder for training. Average of k model errors is assumed as model error. May repeat cross-validation 10 times to reduce variance Leave-One-Out cross validation: Involves using a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. How to choose one data mining method over another one? Estimate cross-validation results with first method. Estimate cross-validation results with second method make sure the same cross-validation data sets are used. Apply paired Student s t-test to determine whether population means differ significantly Occam s razor: The best theory is the smallest one that describes all the facts. Association Rule: given a database of customer transactions, where each transaction is a set of items the goal is to find groups of items which are frequently handled together. Association rules model this information by means of IF THEN rule. The goal of association rules is to identify rules that indicate a strong dependence between antecedent and consequent: confidence Confidence = Cardinality: a measure of the "number of elements of the set. Cardinality = 1: If white then red; cardinality = 2: If white and red then green. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 6

7 Naïve algorithm association rule: Generates all the rules that would be candidates for indicating association between items. All possible combinations of items in a database with p distinct elements (in our example p=6). The algorithms should find all combinations of single items, pair of items, triplets of items and so on. In order to reduce the computational time, a good algorithm should generate only the combinations with higher frequency in the database: Frequent item sets. Support of Rule: the number of transactions in a DB that include both the antecedent and consequent of a rule. Sometime expressed in percentage. F.I. support for {red, white} is 4 or 100 x 4/10 = 40% Apriori algorithm Initially generates a frequent item set with just one item. Successively generate two item frequent set item, three item frequent set item and so on, by discarding item sets with have a support below a desired minimum support. In general, generating k-item sets uses the frequent (k-1)-item sets. Figure 4: apriori algorithm Lecture 3 (Artificial + Biological) Neural Network: Highly parallel models of the brain and nervous system that process information much more like the brain than a serial computer. These models are adaptive systems and can change its structure during a learning phase. Artificial neural networks are suitable for classification and prediction since they are good at extracting and recognizing patterns(the style) and generalize from the already seen to make predictions. Biological Neural Nets: (Pidgeon as art expert ) Neural net that uses neurons and synapse for its internal structure. Artificial Neural Nets: Uses nodes and weights for its internal structure. Feeding data through the net: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 7

8 Each node computes a weighted sum of the inputs and apply a certain function on it. For a set of input values x1,x2,,xp, the output value at node j is g(θj+ wijxi). θj is called bias of node j and it is a constant value the controls the level of contribution of node j. g is the so called transfer function (e.g. squash function) Figure 5: feeding data throughthe net with squashing function Normalizing data: Neural Networks perform best when predictors and response variables are on a scale of [0,1] numerical variable X in the range [a, b]: categorical data a choice of m fractions in [0,1] [0, 0.25, 0.5, 1] Training the model: Estimate the values θj and wij to lead to the best predictive results. Compute the neural network output for each row in the training set.the model produces a prediction which is then compared with the actual response value. Their difference is the error for the output node. This error is used, iteratively, for estimating the weights and bias. One uses a hidden layer which includes nodes, weights and biases. What is visible are the input and output values. Back propagation of error: Method which updates weights and bias values based on the error of the output and starts with the last output of the network. The updating stops whenever new weights are only incrementally different from those of the proceding iteration, when the misclassification rate reaches a required threshold, or when the limit on the number of runs is reached. Each updating iteration is named an epoch. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 8

9 With as the ouput from the output node k and l, as the learning rate. Using a learning rate of 0.5 the weights can be updated: And the backpropagation is as follows: Lecture 4 Classification tree: a predictive model which maps observations about an item to conclusions about the item's target value. The goal is to create a model that predicts the value of a target variable based on several input variables, by representing a graphical view of classification rules. Classification trees have a double level of simplicity both simple for the analyst as for the customers. The square terminal nodes are marked with 0(Non acceptor) or 1 acceptor. Each circle node represents a decision on a given predictor. Each path in the tree can be simply translated in a rule for instance: IF (Income > 92.5) AND (Education < 1.5) AND (Family<=2.5) THEN Class = 0 Recursive Partitioning: For a given variable xi a given value si is chosen to split the p-dimensional space in two parts Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 9

10 One part contains all the points with x i <= s i and the other part contains the points with x i > s i. The process continues until pure rectangles are obtained which are rectangles that contain only points belonging to a given class. The ideal splitting value should reduce impurity(heterogeneity) in resulting rectangles. Measuring the impurity of a rectangle can be done with: Or the entropy measure: Pruning: A common strategy which has the tree grow until each node contains a small number of instances. Then use pruning to remove nodes that do not provide additional information. Clustering: A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups. The results are used to get insight into data distribution or as a preprocessing step for other algorithms. Clustering has a very broad application base such as real value attrbutes, binary attributes, nominal(categorical) attributes, ordinal/ranked attributes or variables of mixed types. If all d dimensions are real-valued then we can visualize each data point as points in a d- dimensional space. If all d dimensions are binary then we can think of each data point as a binary vector. Clustering within cluster: - K-Means: given a set X of n point in a d-dimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the d-dimensional space to form clusters {C1,C2,,Ck} such that is minimized. One way of solving the k-means problem: randomly pick k cluster centers {c1,c2,,ck}. For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i j. For each i let ci be the center of cluster Ci (mean of the vectors in Ci) Repeat until convergence. - K-Median: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 10

11 given a set X of n point in a d-dimensional space and an integer k. Choose a set of k points {c1,c2,,ck} in the d-dimensional space to form clusters {C1,C2,,Ck} such that is minimized. - K-Center/Partitioning around medoids: Choose randomly k medoids from the original dataset X. Assign each of the n-k remaining points to their closest medoid. Iteratively replace one of the medoids by one of the non-medoids of it improves the total clustering cost. Distance function for binary vectors: Jaccard similarity: Jaccard distance: 1- JSim(X,Y) Distance functions for real-valued vectors: L p norm with p: a positive integer: If p=1, then L 1 is the Manhattan distance: If p=2, then L 2 is the Euclidian distance: Outliers: Objects that do not belong to any cluster or form clusters of very small cardinality. Hierarchical clustering: Produces a set of nested clusters organized as a hierarchical tree. Hierarchical clustering can be visualized as a dendogram. No assumptions are made on the number of clusters: any number can be achieved by cutting the dendogram at the proper level. There are two types of hierarchical clustering: agglomerative of divisive. Agglomerative hierarchical clustering: 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains Dendogram: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 11

12 A tree-like diagram that records the sequences of merges or splits. Figure 6: dendogram Distances: Single link distance between clusters Ci and Cj: The minimum distance between any object in Ci and any object in Cj. Drawback: sensitive to noise, produces long clusters Complete link distance between clusters Ci and Cj: The maximum distance between any object in Ci and any object in Cj. Drawback: tends to break large clusters, alle clusters tend to have the same diameter at first. Group average distance between clusters Ci and Cj: The average distance between any object in Ci and any object in Cj. Is less susceptible to noise and outliers. Limitations towards globular clusters. Centroid distance between clusters Ci and Cj: the distance between the centroid ri of Ci and the centroid rj of Cj Ward s distance between clusters Ci and Cj: The difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster Cij. Is less susceptible to noise and outliers but biased towards globular clusters. Divisive hierarchical clustering: Start with a single cluster composed of all data points Split this into components Continue recursively Monothetic divisive methods split clusters using one variable/dimension at a time Polythetic divisive methods make splits on the basis of all variables together Any intercluster distance measure can be used Drawback: computationally intensive, less widely used than agglomerative methods Expectation Maximization Algorithm: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 12

13 Initialize k distribution parameters (θ1,,θk);each distribution parameter corresponds to a cluster center. Iterate between two steps: Expectation step: (probabilistically) assign points to clusters. Maximization step: estimate model parameters that maximize the likelihood for the given assignment of points. Process mining part Lecture 5 Data mining (sometimes called data or knowledge discovery): the process of analyzing data from different perspectives and summarizing it into useful information, information that can be used to increase revenue, cuts costs, or both. Process Mining: the extraction of non-trivial information from a registration what happens during the execution of a process (a so called event log). Information about what really happens within a process, not what the owners/managers think what happens within the process (objective vs. subjective information). Can be applied for performance analysis, auditing/security, organizational models and process models. Issues that may hamper the application of process mining: Event data is of low / poor / bad quality; Not everything is logged: some process steps are not recorded or time information is missing or to roughly. A lot of effort is required to get / prepare the data. Assumptions about event logs A process consists of cases. A case consists of events such that each event relates to precisely one case. Events within a case are ordered. Events can have attributes. Examples of typical attribute names are activity, time, costs, and resource. Flattering reality into event logs: In order to perform process mining, events need to be related to cases. However, in real life processes are not flat (see order example in Section 4.4 of the book). In a hospital context Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 13

14 the case can be the treatments of a patient, the tasks of a nurse, etc. In a real process the flattered sub processes are intertwined. Alpha algorithm: Direct succession: x>y iff for some case x is directly followed by y. Causality: x y iff x>y and not y>x. Parallel: x y iff x>y and y>x Choice: x#y iff not x>y and not y>x. Ti = start To = end Tw = total Y contains the places Pw contains the arrows There are no free choice constructs: only direct following relations are used in the alpha miner. Challenges are noise such as: Hidden tasks, Duplicate tasks, Non-free-choice constructs, Loops, Mining and exploiting time, Dealing with noise, Dealing with incompleteness General Mining issues: Quality of the (learning) data Generalization: data mining: not only a correct classification of the cases in the learning material, but also a correct classification of new cases. In process mining, an unlimited number of traces can be possible, so generalization is difficult. Overfitting Noise Model representation bias Search technique bias: Quality of mined models: Parameter optimalization using k fold cross validation + t-test Classification: % of correct classified cases Estimation: Mean Squared Error = Performance measure always on NEW-material 1 n n i 1 ( t i x i ) 2 Quality of the mined model: Fitness: is the observed behavior captured by the model? Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 14

15 Precision: does the model only allow for behavior that happens in reality? Generalization: does the model allow for more behavior than encountered in reality? Structure/Simplicity: does the model have a minimal structure to describe the behavior (easy to understand models)? Quality of Data Mining model One measure Measures independent from mining technique Benchmark datasets Parameter optimalization Quality of Process Mining model Combination of 3 measures (generalization is missing) Strongly connected with Petri net formalism (theoretical strong, practical ) Many event logs but not a clear benchmark set Parameter optimalization unclear Lecture 6 Sometimes one has to model very extensive and big models and these will look like spaghetti. In situations with low-structured domains, for instance health care, noise and low frequent behavior, one has to use different heuristics that are more flexible, focus on main behavior and take the frequency of the behavior in the event log into account. Examples are the Flexible Heuristics Miner, the Genetic Miner or the Fuzzy Miner. Flexible Heuristic miner: Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 15

16 Five different types of noise generating operations: (i) delete the head of a trace, (ii) delete the tail of a trace, (iii) delete a part of the body, (iv) remove one event, and finally (v) interchange two random chosen events. Fuzzy miner: Roadmap principle in process mining: Emphasys: Level of detail addapted to purpose Customization: Significant information highlighted by visual means Abstraction: Low level information omitted Aggregation: Clusters of low level detail information Uses internal filters: Concurrency filter, Edge filter, Node Filter Lecture 7 Crisp: Cross industry standard process for data mining: 1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling 5. Evaluation 6. Deployment Data mining overview Goal is the mining of a prediction model The model is employed in business process Process mining overview Goal is a better understanding of the business process The knowledge is used to improve the process. Summary Data Mining & Process Mining (1BM46) by S.P.T. Ariesen 16

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

More information

Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning. Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of

6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Lecture 20: Clustering

Lecture 20: Clustering Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

15.564 Information Technology I. Business Intelligence

15.564 Information Technology I. Business Intelligence 15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses

More information

Introduction to Statistical Machine Learning

Introduction to Statistical Machine Learning CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Clustering & Association

Clustering & Association Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

More information

Role of Neural network in data mining

Role of Neural network in data mining Role of Neural network in data mining Chitranjanjit kaur Associate Prof Guru Nanak College, Sukhchainana Phagwara,(GNDU) Punjab, India Pooja kapoor Associate Prof Swami Sarvanand Group Of Institutes Dinanagar(PTU)

More information

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master of Business Information Systems, Department of Mathematics and Computer Science Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data Master Thesis Author: ing. Y.P.J.M.

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca mario.lucchini@unimib.it maurizio.pisati@unimib.it

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Flat Clustering K-Means Algorithm

Flat Clustering K-Means Algorithm Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but

More information

Automatic Web Page Classification

Automatic Web Page Classification Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

More information

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning Unsupervised Learning and Data Mining Unsupervised Learning and Data Mining Clustering Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

More information

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013 Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network General Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Impelling

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Data Mining and Clustering Techniques

Data Mining and Clustering Techniques DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

DATA MINING TECHNIQUES

DATA MINING TECHNIQUES DATA MINING TECHNIQUES Mohammed J. Zaki Department of Computer Science, Rensselaer Polytechnic Institute Troy, New York 12180-3590, USA E-mail: zaki@cs.rpi.edu Limsoon Wong Institute for Infocomm Research

More information