Text Clustering. Clustering



Similar documents
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering UE 141 Spring 2013

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

Unsupervised learning: Clustering

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Information Retrieval and Web Search Engines

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Chapter 7. Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Social Media Mining. Data Mining Essentials

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Cluster Analysis: Basic Concepts and Algorithms

Clustering Connectionist and Statistical Language Processing

Hadoop SNS. renren.com. Saturday, December 3, 11

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

How To Cluster

Introduction to Clustering

Machine Learning using MapReduce

Standardization and Its Effects on K-Means Clustering Algorithm

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Learning a Metric during Hierarchical Clustering based on Constraints

SoSe 2014: M-TANI: Big Data Analytics

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Statistical Databases and Registers with some datamining

A comparison of various clustering methods and algorithms in data mining

Vector Quantization and Clustering

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Chapter ML:XI (continued)

Cluster Analysis: Basic Concepts and Algorithms

Personalized Hierarchical Clustering

Hierarchical Cluster Analysis Some Basics and Algorithms

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Movie Classification Using k-means and Hierarchical Clustering

Distances, Clustering, and Classification. Heatmaps

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster analysis Cosmin Lazar. COMO Lab VUB

Topological Tree Clustering of Social Network Search Results

Cluster Analysis: Advanced Concepts

Clustering Technique in Data Mining for Text Documents

Cluster Analysis using R

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Dynamical Clustering of Personalized Web Search Results

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

A Relevant Document Information Clustering Algorithm for Web Search Engine

Environmental Remote Sensing GEOG 2021

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

An Introduction to Cluster Analysis for Data Mining

Cluster analysis with SPSS: K-Means Cluster Analysis

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

Map-Reduce for Machine Learning on Multicore

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

DATA CLUSTERING USING MAPREDUCE

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Chapter ML:XI (continued)

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Introduction to Clustering

A Study of Web Log Analysis Using Clustering Techniques

A SURVEY OF TEXT CLUSTERING ALGORITHMS

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Distances between Clustering, Hierarchical Clustering

CLUSTERING FOR FORENSIC ANALYSIS

Partitioning and Divide and Conquer Strategies

Constrained Clustering of Territories in the Context of Car Insurance

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Content Based Data Retrieval on KNN- Classification and Cluster Analysis for Data Mining

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Unsupervised Data Mining (Clustering)

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

B490 Mining the Big Data. 2 Clustering

A Business Process Driven Approach for Generating Software Modules

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Comparative Analysis of Supervised and Unsupervised Discretization Techniques

Forschungskolleg Data Analytics Methods and Techniques

Unsupervised and supervised data classification via nonsmooth and global optimization 1

A Survey of Clustering Techniques

Mining Text Data: An Introduction

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

How To Solve The Cluster Algorithm

International Journal of Computer Engineering and Applications, Volume IX, Issue VI, Part I, June

Bisecting K-Means for Clustering Web Log data

Transcription:

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner (no sample category labels provided). 2 1

Clustering Example.............. 3 Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean Recursive application of a standard clustering algorithm can produce a hierarchical clustering. 4 2

Aglommerative vs. Divisive Clustering Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. Divisive (partitional, top-down) separate all examples immediately into clusters. 5 Direct Clustering Method Direct clustering methods require a specification of the number of clusters, k, desired. A clustering evaluation function assigns a realvalue quality measure to a clustering. The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function. 6 3

Hierarchical Agglomerative Clustering (HAC) Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly oins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy. 7 HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c, that are most similar. Replace c i and c with a single cluster c i c 8 4

Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y). Cosine similarity of document vectors. How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members. 9 Single Link Agglomerative Clustering Use maximum similarity of pairs: sim( c, c ) = max sim( x, y) i x c, y i c Can result in straggly (long and thin) clusters due to chaining effect. Appropriate in some domains, such as clustering islands. 10 5

Single Link Example 11 Complete Link Agglomerative Clustering Use minimum similarity of pairs: sim( c, c i ) = min x c, y i c sim( x, y) Makes more tight, spherical clusters that are typically preferable. 12 6

Complete Link Example 13 Computing Cluster Similarity After merging c i and c, the similarity of the resulting cluster to any other cluster, c k, can be computed by: Single Link: sim (( ci c ), ck ) = max( sim( ci, ck ), sim( c, ck )) Complete Link: sim (( ci c ), ck ) = min( sim( ci, ck ), sim( c, ck )) 14 7

Group Average Agglomerative Clustering Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. 1 r r sim( ci, c) = sim( x, y) c c ( c c 1) r r r r i i x ( c c ) y ( c c ): y x Compromise between single and complete link. i i 15 Non-Hierarchical Clustering Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations. 16 8

K-Means Assumes instances are real-valued vectors. Clusters based on centroids, center of gravity, or mean of points in a cluster, c: r 1 r µ(c) = x c x r c Reassignment of instances to clusters is based on distance to the current cluster centroids. 17 Distance Metrics Euclidian distance (L 2 norm): m r r 2 L2 ( x, y) = ( x i yi ) i= 1 L 1 norm: m r r L1 ( x, y) = x i y i i= 1 Cosine Similarity (transform to a distance by subtracting from 1): 1 r x r x r y r y 18 9

K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s 1, s 2, s k } as seeds. Until clustering converges or other stopping criterion: For each instance x i : Assign x i to the cluster c such that d(x i, s ) is minimal. Update the seeds to the centroid of each cluster: For each cluster c s = µ(c ) 19 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! 20 10

Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic or the results of another method. 21 Text Clustering HAC and K-Means have been applied to text in a straightforward way. Typically use normalized, TF/IDF-weighted vectors and cosine similarity. Optimize computations for sparse vectors. Applications: During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall. Clustering of results of retrieval to present more organized results to the user (à la Northernlight folders). Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo & DMOZ). 22 11

Soft Clustering Clustering typically assumes that each instance is given a hard assignment to exactly one cluster. Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. Soft clustering gives probabilities that an instance belongs to each of a set of clusters. Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). 23 Exercise Cluster to following documents using K-means with K=2 and cosine similarity. Doc1: go monster go Doc2: go karting Doc3: karting monster Doc4: monster monster Assume Doc1 and Doc3 are chosen as initial seeds. Use tf (no idf). Show the clusters and their centroids for each iteration. The algorithm should converge after 2 iterations. 24 12