# Personalized Hierarchical Clustering

Save this PDF as:

Size: px
Start display at page:

## Transcription

5 3.3. Termination of the Algorithm The most natural stopping criterion for the weight adaptation would be a correct clustering of the known documents, i.e. a clustering that does not violate any constraints. However, such a solution might not exist. Therefore, we need another criterion. Most importantly, we need a measure to evaluate the quality of the current clustering to be able to decide whether the algorithm is still improving. For this purpose, we define the cluster error CE. For each document d with a given constraint (d, (S 1,..., S m )) = (d, S), we can derive from the dendrogram of the current clustering the order, in which documents are merged to d, producing another ordered list of document sets R = (R 1,..., R r ). To compute CE, we count how often the order in R violates the order in S, i.e. the order of two documents d i and d j is reversed. In addition, violations between classes that are further apart in the hierarchy are more severe. Therefore, constraint violations should be weighted. The distance between two classes is reflected in the distance between two constraint sets in the ordered list and can therefore be utilized for weighting. The overall cluster error CE can therefore be defined as: CE = (d,s,r) (d ik,x,d jl,y ) k<l { x y if x > y 0 else }, (5) where d ik,x R k, d ik,x S x, d jl,y R l and d jl,y S y. The cluster error can directly be used to define a stopping criterion for the algorithm. If the cluster error could not be decreased in a certain number of iterations, the algorithm is supposed to not find any better solution any more and stops. The complete algorithm for learning the weights is summarized in Fig Evaluation We evaluated BiHAC with a part of the banksearch dataset [12], consisting of 450 web pages in a two-level hierarchy. We used the classes and number of documents as shown in Fig. 3. After parsing the documents, we applied standard document preprocessing methods, e.g. filtered out terms that occurred in less than five documents, in almost all documents (> D 5), stop words, terms with less than four characters, and terms containing numbers. Each document that had no terms left after these preprocessing steps was ignored. In our dataset, this was true for two documents, one from the Insurance Agencies class and one from the Sport class. Finally, we selected out of the remaining terms the best 4000 as described in [3]. The method uses an entropy measure and ensures an about equal coverage of the documents by the selected terms. Term selection was done to increase clustering speed. weightadapt(mlb, D) Initialize w: w i : w i := 1 Do: Initialize clustering with D w (a) := 0 While not all clusters are merged: Merge two closest clusters c 1 and c 2 If MLB is violated by merge: Determine the most similar clusters c (s) 1 and c (s) 2 to c 1 and c 2, respectively, that can be merged according to MLB i : w (a) i := w (a) i + (c 1,i c (s) 1,i c 1,i c 2,i ) i : w (a) i := w (a) i + (c 2,i c (s) 2,i c 1,i c 2,i ) Compute current cluster error CE Adapt w := w + η w (a) i : if w i < 0 then w i := 0 Normalize w such that i w i = n holds Until w (a) = 0 or CE did not improve for r runs Return w Figure 2. The weight adaptation of BiHAC Finance (0) Commercial Banks (50) Building Societies (50) Insurance Agencies (50) Sport (50) Soccer (50) Motor Sport (50) Programming (0) C/C++ (50) Java (50) Visual Basic (50) Figure 3. Class Structure of our Training Data 4.1. Evaluation Procedure It is difficult to find a good evaluation strategy to measure the quality of cluster personalization. Usually, user studies are necessary, however they are very time consuming and not in scope of this paper. Furthermore, we deal with a hierarchical class structure in the data. And third, our algorithm provides so far a very fine-grained structure by the dendrogram giving the complete nested dichotomy of the data. Although this could be displayed to the user, a more coarse structure would improve navigation. Therefore, we decided on a two-fold evaluation procedure. First, we measure how well the hierarchical structure of the data was learned. Second, we measure the classification quality that can be gained, independently of the hierarchical relations between the classes. To compare the learned cluster structure with the given class structure, we used the MLB constraints in a similar way as for the cluster error presented in (5). However, for

6 evaluation purposes, the relations between all documents in D are defined by MLB constraints. Then we can use the cluster error CE as a measure of how good the user structure was learned. A value of 0 denotes a perfect match. For measuring classification quality, we need to assign labels to unlabeled documents. A common strategy in semisupervised clustering is to label unlabeled documents in the same cluster of a labeled item with this label. However, our algorithm does not actually extract distinct clusters. We therefore follow a simple strategy. From the dendrogram, we extract the largest possible clusters that only contain items with the same class label. For super-classes, we consider each document directly assigned to this class or assigned to a subclass as labeled with this class. Note that extracted super-clusters therefore contain sub-clusters, which is in correlation with the given class structure. A very simple example is given in Fig. 4. The underlined numbers are the labeled data items available. Figure 4. From Dendrogram to Classes (Ex.) With this derived class structure, we evaluate the classification quality of the algorithm with a standard evaluation measure from classification, the F-score (as a combination of precision and recall). The precision of a class c is the fraction of all documents retrieved for this class that are correctly retrieved, while the recall of c is the fraction of all documents belonging to this class that are actually retrieved. The F-Score combines both measures to describe the found trade-off between the two (see e.g. [6]). Please note that these measures are computed on the unlabeled data only as the labeled data is correctly labeled by default. This also allows us to compare our approach to a standard classifier as a baseline. Here, we trained a support vector machine (SVM; based on the libsvm implementation [4]). As mentioned above, the problem itself cannot be handled by a classifier in general. However, in this evaluation, we deal with a special case, where C u is empty. Therefore, a classifier can be applied for comparison. When computing precision and recall for higher level classes in the hierarchy, we decided to evaluate these classes as a whole, i.e. all elements from sub classes are handled as if they belong to this class. This helps to compare, how the performance changes on different hierarchy levels. To evaluate BiHAC, we used different numbers of labeled documents per class: zero (which leads to the standard hierarchical agglomerative clustering), one, five, ten, and twenty. With these selected documents, we formed the MLB constraints used by the algorithm for learning the weights. Two results of each evaluation run were captured. One after a fixed number of five runs (named with BiHAC- 5) and one with maximally minimized cluster error on the training data (i.e. the labeled data) Results Figure 5 shows the cluster error of the complete data. More specific, the percentage of violated constraints is presented. As can be seen, our approach is capable of adapting towards a user specific class structure. Already a single labeled document per class reduces the cluster error CE from 40% to 31%. More labeled documents can decrease the cluster error even further, although improvements get less because many important terms occur in several documents. This means a new document also provides information that is already known. Interestingly, large improvements can already be gained after five iterations of the algorithm. The more training data available, the more similar are the gained improvements in cluster error. Moreover note that the minimum of the cluster error is most likely not zero as there are conflicting constraints, either introduced by noise or due to the fact that with creating the document vectors valuable information about the documents got lost. Figure 5. Cluster Error Figure 6 presents the F-score. It is a class specific measure. In order to get an overall evaluation of the performance, we calculate the mean over all classes. Since we have a hierarchical setting here, we calculate two different mean values, one for each hierarchy level. On the first level, we only distinguish between the three classes Banking, Programming, and Sport. Documents labeled with a subclass are counted to be labeled with its parent class. The second hierarchy level consists of the eight subclasses. Both algorithms show a similar behavior. The F-score is increasing with increasing amount of labeled data on both hierarchy levels. The performance of both algorithms is higher on level 1. That means, both algorithms make most mistakes between classes that describe subtopics but can separate the main classes quite well. This is also expected as distinguishing between top-level classes is easier then between more specific classes. The results obtained after only

7 that extracts a more coarse-grained cluster structure from the fine-grained dendrogram containing meaningful clusters. This is especially important for result visualization as a dendrogram requires too much navigation until a sought document is found. Furthermore, we want to explore the integration of more than one given hierarchy, e.g. from different users, to further guide the clustering process. References Figure 6. Mean F-Score five iterations of BiHAC are comparable to SVM or even better. However, optimal adaptation to the user specific constraints leads to a decrease of performance. This is probably due to overfitting to the training data. We will tackle this problem in future work. It is interesting to mention that our algorithm clearly outperforms SVM on the first hierarchy level, when only a single labeled document is available. Our evaluation has shown two main points. First, our algorithm is actually able to adapt towards a user specific structure. And second, classification performance gained by a very simple labeling strategy provides results comparable or even better to state of the art classifiers as SVM. Furthermore, we like to mention two more aspects, which were not further evaluated. On the one hand, the algorithm provides further structuring capabilities. This includes refinement of user classes or the exploration of new hidden classes. On the other hand, the personalization of the clustering is gained by learning term weights. The final clustering as presented to the user is influenced only by using these weights in the similarity measure. The clustering process itself is not changed and thus the obtained weighting scheme could also be stored in a user profile for further use, e.g. to describe a user s interests or to be applied to other different classification or clustering methods. To evaluate these aspects is subject to future work. 5. Conclusion and Future Work In this paper, we presented an approach for user-specific hierarchical clustering called BiHAC. As a basis, we extended the definitions of constraint based clustering to express hierarchical constraints. BiHAC proved to be able to adapt to a user structure by learning term weights in an iterative process. This adaptation as well as classification performance were evaluated using a benchmark dataset. The results encourage further work in this direction. Besides the issues already mentioned throughout the paper, we want to add a post-processing step to our clustering algorithm [1] S. Basu, A. Banerjee, and R. Mooney. Active semisupervision for pairwise constrained clustering. In Proc. of the 4 th SIAM International Conf. on Data Mining, [2] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? In Proc. of the 7 th Intern. Conf. on Database Theory, pages , [3] C. Borgelt and A. Nürnberger. Fast fuzzy clustering of web page collections. In Proc. of PKDD Workshop on Statistical Approaches for Web Mining (SAWM), [4] C. Chang and C. Lin. LIBSVM: a library for support vector machines, Software available at cjlin/libsvm. [5] D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR , Cornell University, [6] A. Hotho, A. Nürnberger, and G. Paaß. A brief survey of text mining. GLDV-Journal for Computational Linguistics and Language Technology, 20(1):19 62, [7] H. Kim and S. Lee. An effective document clustering method using user-adaptable distance metrics. In Proceedings of the 2002 ACM symposium on Applied computing, pages 16 20, New York, NY, USA, ACM Press. [8] D. Klein, S. Kamvar, and C. Manning. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proc. of the 19 th International Conf. on Machine Learning, pages , [9] A. Nürnberger and M. Detyniecki. Weighted self-organizing maps - incorporating user feedback. In Proc. of the joined 13 th International Conference on Artificial Neural Networks and Neural Information Processing, pages , [10] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5): , [11] G. Salton, A. Wong, and C. Yang. A vector space model for information retrieval. Journal of the American Society for Information Science, 18(11): , [12] M. Sinka and D. Corne. A large benchmark dataset for web document clustering. In Soft Computing Systems: Design, Management and Applications, Vol. 87 of Frontiers in Artificial Intelligence and Applications, pages , [13] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In Proc. of 18 th Intern. Conf. on Machine Learning, [14] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems 15, pages

### Learning a Metric during Hierarchical Clustering based on Constraints

Learning a Metric during Hierarchical Clustering based on Constraints Korinna Bade and Andreas Nürnberger Otto-von-Guericke-University Magdeburg, Faculty of Computer Science, D-39106, Magdeburg, Germany

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

### ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Hierarchical Classification by Expected Utility Maximization

Hierarchical Classification by Expected Utility Maximization Korinna Bade, Eyke Hüllermeier, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

### UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

### Text Clustering. Clustering

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

### A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

### Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### CLUSTERING FOR FORENSIC ANALYSIS

IMPACT: International Journal of Research in Engineering & Technology (IMPACT: IJRET) ISSN(E): 2321-8843; ISSN(P): 2347-4599 Vol. 2, Issue 4, Apr 2014, 129-136 Impact Journals CLUSTERING FOR FORENSIC ANALYSIS

### Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand

### Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

### Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

### Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

### Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

### Data Mining Techniques Chapter 11: Automatic Cluster Detection

Data Mining Techniques Chapter 11: Automatic Cluster Detection Clustering............................................................. 2 k-means Clustering......................................................

### Self Organizing Maps for Visualization of Categories

Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, julian.szymanski@eti.pg.gda.pl

### Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

### EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING

EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur Lovely Professional University Phagwara- Punjab Abstract Clustering is an essential

### Resource-bounded Fraud Detection

Resource-bounded Fraud Detection Luis Torgo LIAAD-INESC Porto LA / FEP, University of Porto R. de Ceuta, 118, 6., 4050-190 Porto, Portugal ltorgo@liaad.up.pt http://www.liaad.up.pt/~ltorgo Abstract. This

### Generating Diverse Clusterings

Generating Diverse Clusterings Anonymous Author(s) Abstract Traditional clustering algorithms search for an optimal clustering based on a pre-specified clustering criterion, e.g. the squared error distortion

### Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Clustering Documents with Active Learning using Wikipedia

Clustering Documents with Active Learning using Wikipedia Anna Huang David Milne Eibe Frank Ian H. Witten Department of Computer Science, University of Waikato Private Bag 3105, Hamilton, New Zealand {lh92,

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### Dr. Antony Selvadoss Thanamani, Head & Associate Professor, Department of Computer Science, NGM College, Pollachi, India.

Enhanced Approach on Web Page Classification Using Machine Learning Technique S.Gowri Shanthi Research Scholar, Department of Computer Science, NGM College, Pollachi, India. Dr. Antony Selvadoss Thanamani,

### Data Mining and Clustering Techniques

DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : Multi-Agent

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

### There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

### Study of Euclidean and Manhattan Distance Metrics using Simple K-Means Clustering

Study of and Distance Metrics using Simple K-Means Clustering Deepak Sinwar #1, Rahul Kaushik * # Assistant Professor, * M.Tech Scholar Department of Computer Science & Engineering BRCM College of Engineering

### The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

### IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering

### K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

### Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### A Survey of Document Clustering Techniques & Comparison of LDA and movmf

A Survey of Document Clustering Techniques & Comparison of LDA and movmf Yu Xiao December 10, 2010 Abstract This course project is mainly an overview of some widely used document clustering techniques.

### Semi-Supervised Learning for Blog Classification

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

### Performance Analysis of Clustering using Partitioning and Hierarchical Clustering Techniques. Karunya University, Coimbatore, India. India.

Vol.7, No.6 (2014), pp.233-240 http://dx.doi.org/10.14257/ijdta.2014.7.6.21 Performance Analysis of Clustering using Partitioning and Hierarchical Clustering Techniques S. C. Punitha 1, P. Ranjith Jeba

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

### Cluster Analysis: Basic Concepts and Algorithms

8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

### PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

### SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

### Component visualization methods for large legacy software in C/C++

Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

### Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

### DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,