Clustering Connectionist and Statistical Language Processing

Similar documents

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Machine Learning using MapReduce

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Big Data Analytics for SCADA

Comparison of K-means and Backpropagation Data Mining Algorithms

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Using Data Mining for Mobile Communication Clustering and Characterization

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Machine Learning for natural language processing

Cluster Analysis: Basic Concepts and Algorithms

DEA implementation and clustering analysis using the K-Means algorithm

Experiments in Web Page Classification for Semantic Web

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Mining an Online Auctions Data Warehouse

Social Media Mining. Data Mining Essentials

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

CLUSTER ANALYSIS FOR SEGMENTATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Clustering UE 141 Spring 2013

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MS1b Statistical Data Mining

Information Retrieval and Web Search Engines

K-Means Clustering Tutorial

2014/02/13 Sphinx Lunch

Introduction to Clustering

Introduction to Machine Learning Using Python. Vikram Kamath

Data Mining with Weka

Knowledge Discovery from patents using KMX Text Analytics

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Machine Learning for Data Science (CS4786) Lecture 1

Facilitating Business Process Discovery using Analysis

Web Document Clustering

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Building a Question Classifier for a TREC-Style Question Answering System

Big Data: Rethinking Text Visualization

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Chapter 7. Cluster Analysis

Text Clustering Using LucidWorks and Apache Mahout

Data Mining Individual Assignment report

Data Mining 資料探勘. 分群分析 (Cluster Analysis)

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Active Learning SVM for Blogs recommendation

Analytics on Big Data

Analyzing Transaction Data

Mining event log patterns in HPC systems

Unsupervised learning: Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

A comparison of various clustering methods and algorithms in data mining

Linear programming approach for online advertising

An Introduction to WEKA. As presented by PACE

Brill s rule-based PoS tagger

Introduction Predictive Analytics Tools: Weka

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Monday Morning Data Mining

Cluster analysis with SPSS: K-Means Cluster Analysis

SoSe 2014: M-TANI: Big Data Analytics

A Relevant Document Information Clustering Algorithm for Web Search Engine

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Machine Learning Capacity and Performance Analysis and R

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Personalized Hierarchical Clustering

Numerical Algorithms for Predicting Sports Results

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Maschinelles Lernen mit MATLAB

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

An Overview of Knowledge Discovery Database and Data mining Techniques

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Clustering Marketing Datasets with Data Mining Techniques

Clustering Technique in Data Mining for Text Documents

COC131 Data Mining - Clustering

Towards better understanding Cybersecurity: or are "Cyberspace" and "Cyber Space" the same?

LVQ Plug-In Algorithm for SQL Server

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Part 5. Prediction

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Environmental Remote Sensing GEOG 2021

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Hadoop SNS. renren.com. Saturday, December 3, 11

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Azure Machine Learning, SQL Data Mining and R

Classification algorithm in Data mining: An Overview

Technical Report. The KNIME Text Processing Feature:

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

An Introduction to Data Mining

SENTIMENT ANALYSIS: A STUDY ON PRODUCT FEATURES

Support Vector Machines with Clustering for Training with Very Large Datasets

Transcription:

Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21

Overview clustering vs. classification supervised vs. unsupervised learning types of clustering algorithms the k-means algorithm applying clustering to word similarity evaluating clustering models Literature: Witten and Frank (2000: ch. 6), Manning and Schütze (1999: ch. 14) Clustering p.2/21

Clustering vs. Classification Classification: the task is to learn to assign instances to predefined classes. Clustering: no predefined classification is required. The task is to learn a classification from the data. Clustering algorithms divide a data set into natural groups (clusters). Instances in the same cluster are similar to each other, they share certain properties. Clustering p.3/21

Supervised vs. Unsupervised Learning Supervised learning: classification requires supervised learning, i.e., the training data has to specify what we are trying to learn (the classes). Unsupervised learning: clustering is an unsupervised task, i.e., the training data doesn t specify what we are trying to learn (the clusters). Example: to learn the parts of speech in a supervised fashion we need a predefined inventory of POS (a tagset) and an annotated corpus. To learn POS in an unsupervised fashion, we only need an unannotated corpus. The task is to discover a set of POS. Clustering p.4/21

Clustering Algorithms Clustering algorithms can have different properties: Hierarchical or flat: hierarchical algorithms induce a hierarchy of clusters of decreasing generality, for flat algorithms, all clusters are the same. Iterative: the algorithm starts with initial set of clusters and improves them by reassigning instances to clusters. Hard and soft: hard clustering assigns each instance to exactly one cluster. Soft clustering assigns each instance a probability of belonging to a cluster. Disjunctive: instances can be part of more than one cluster. Clustering p.5/21

Examples a d k g j i h e Hard, non-hierarchical, nondisjunctive c f b k a d g j e h Hard, non-hierarchical, disjunctive c i f b Clustering p.6/21

Examples g a c i e d k b j f h Hard, hierarchical, non-disjunctive 1 2 3 a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0.4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 f 0.1 0.4 0.5 g 0.7 0.2 0.1 h 0.5 0.4 0.1 Soft, non-hierarchical, disjunctive Clustering p.7/21

Using Clustering Clustering can be used for: Exploratory data analysis: visualize the data at hand, get a feeling for what the data look like, what its properties are. First step in building a model. Generalization: discover instances that are similar to each other, and hence can be handled in the same way. Example for generalization: Thursday and Friday co-occur with the preposition on in our data, but Monday doesn t. If we know that Thursday, Friday, and Monday are syntactically similar, then we can infer that they all co-occur with on. Clustering p.8/21

Properties of Clustering Algorithms Hierarchical clustering: Preferable for detailed data analysis. Provides more information than flat clustering. No single best algorithm exists (different algorithms are optimal for different applications). Less efficient than flat clustering. Clustering p.9/21

Properties of Clustering Algorithms Flat clustering: Preferable if efficiency is important or for large data sets. k-means is conceptually the most simple method, should be used first on new data, results are often sufficient. k-means assumes a Euclidian representation space; inappropriate for nominal data. Alternative: EM algorithm, uses definition of clusters based on probabilistic models. Clustering p.10/21

The k-means Algorithm Iterative, hard, flat clustering algorithm based on Euclidian distance. Intuitive formulation: Specify k, the number of clusters to be generated. Chose k points at random as cluster centers. Assign each instance to its closest cluster center using Euclidian distance. Calculate the centroid (mean) for each cluster, use it as new cluster center. Reassign all instances to the closest cluster center. Iterate until the cluster centers don t change any more. Clustering p.11/21

The k-means Algorithm Each instance x in the training set can be represented as a vector of n values, one for each attribute: (1) x x 1 x 2 x n The Euclidian distance of two vectors x and y is defined as: (2) x y n i 1 x i y i 2 The mean µ of a set of vectors c j is defined as: 1 (3) µ c j x c j x Clustering p.12/21

The k-means Algorithm Example: 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Assign instances (crosses) to closest cluster centers (circles) according to Euclidian distance 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Recompute cluster centers as means of the instances in each cluster Clustering p.13/21

The k-means Algorithm Properties of the algorithm: It only finds a local maximum, not a global one. The clusters it comes up with depend a lot on which random cluster centers are chose initially. Can be used for hierarchical clustering: first apply k-means with k 2, yielding two clusters. Then apply it again on each of the two clusters, etc. Distance metrics other than Euclidian distance can be used, e.g., the cosine. Clustering p.14/21

Applications of Clustering Task: discover contextually similar words First determine a set of context words, e.g., the 500 most frequent words in the corpus. Then determine a window size, e.g., 10 words (5 to the left and 5 to the right). For each word in the corpus, compute a co-occurrence vector that specifies how often the word co-occurs with the context words in the given window. Use clustering to find words with similar context vectors. This can find words that are syntactically or semantically similar, depending on parameters (context words, window size). Clustering p.15/21

Applications of Clustering Example (Manning and Schütze 1999: ch. 8): Use adjectives (rows) as context words to cluster nouns (columns) according to semantic similarity. cosmonaut astronaut moon car truck Soviet 1 0 0 1 1 American 0 1 0 1 1 spacewalking 1 1 0 0 0 red 0 0 0 1 1 full 0 0 1 0 0 old 0 0 0 1 1 Cosmonaut and astronaut are similar, as are car and truck. Clustering p.16/21

Evaluating Clustering Models As with all machine learning algorithms, set the model parameters on the training set. Then test its performance on the test set, keeping the parameters constant. Training set: set of instances used to generate the clusters (compute the cluster centers in the case of k-means). Test set: a set of unseen instances classified using to the clusters generated during training. Problem: How do we determine k, i.e., the number of clusters? (Many algorithms require a fixed k, not only k-means.) Solution: treat k as a parameter, i.e., vary k and find the best performance on a given data set. Clustering p.17/21

Evaluating Clustering Models Problem: How do we evaluate the performance on the test set? How do we know if the clusters are correct? Possible solutions: Test the resulting clusters intuitively, i.e., inspect them and see if they make sense. Not advisable. Have an expert generate clusters manually, and test the automatically generated ones against them. Test the clusters against a predefined classification if there is one. Perform task-based evaluation, i.e., test if the performance of some other algorithm can be improved by using the output of the clusterer. Clustering p.18/21

Evaluating Clustering Models Example: Use a clustering algorithm to discover parts of speech in a set of word. The algorithm should group together words with the same syntactic category. Intuitive: check if the words in the same cluster seem to have the same part of speech. Expert: ask a linguist to group the words in the data set according to POS. Compare the result to the automatically generated clusters. Classification: look up the words in a dictionary. Test if words in the same cluster have the same POS in the dictionary. Task: use the clusters for parsing (for example) and test if the performance of the parser improves. Clustering p.19/21

Summary Clustering algorithms discover groups of similar instances, instead of requiring a predefined classification. They are unsupervised. Depending on the application, hierarchical or flat, and hard or soft clustering is appropriate. The k-means algorithm assigns instances to clusters according to Euclidian distance to the cluster centers. Then it recomputes cluster centers as the means of the instances in the cluster. Clusters can be evaluated against an external classification (expert-generated or predefined) or task-based. Clustering p.20/21

References Manning, Christopher D., and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and Techniques with Java Implementations. San Diego, CA: Morgan Kaufmann. Clustering p.21/21