CPSC 340: Machine Learning and Data Mining. K-Means Clustering Fall 2015

Similar documents
Social Media Mining. Data Mining Essentials

Using multiple models: Bagging, Boosting, Ensembles, Forests

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Cluster Analysis: Basic Concepts and Algorithms

How To Cluster

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

K-Means Clustering Tutorial

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Unsupervised learning: Clustering

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Comparison of K-means and Backpropagation Data Mining Algorithms

MS1b Statistical Data Mining

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Distances, Clustering, and Classification. Heatmaps

Introduction to Machine Learning Using Python. Vikram Kamath

Analytics on Big Data

Machine Learning using MapReduce

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Dynamical Clustering of Personalized Web Search Results

Using Data Mining for Mobile Communication Clustering and Characterization

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Cluster Analysis: Basic Concepts and Algorithms

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Supervised and unsupervised learning - 1

How To Perform An Ensemble Analysis

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Neural Networks Lesson 5 - Cluster Analysis

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Monday Morning Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Environmental Remote Sensing GEOG 2021

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering UE 141 Spring 2013

Clustering Connectionist and Statistical Language Processing

The Scientific Data Mining Process

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining. Nonlinear Classification

Information Retrieval and Web Search Engines

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

L25: Ensemble learning

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Local outlier detection in data forensics: data mining approach to flag unusual schools

6.2.8 Neural networks for data mining

MACHINE LEARNING IN HIGH ENERGY PHYSICS

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Leveraging Ensemble Models in SAS Enterprise Miner

Chapter 7. Cluster Analysis

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Colour Image Segmentation Technique for Screen Printing

: Introduction to Machine Learning Dr. Rita Osadchy

Specific Usage of Visual Data Analysis Techniques

Lecture 6 Online and streaming algorithms for clustering

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Cluster Analysis: Basic Concepts and Methods

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Support Vector Machines with Clustering for Training with Very Large Datasets

Novelty Detection in image recognition using IRF Neural Networks properties

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Data Mining - Evaluation of Classifiers

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Maschinelles Lernen mit MATLAB

The Delicate Art of Flower Classification

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Azure Machine Learning, SQL Data Mining and R

Self Organizing Maps: Fundamentals

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Chapter ML:XI (continued)

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Fast Matching of Binary Features

How To Solve The Cluster Algorithm

Segmentation & Clustering

Segmentation of stock trading customers according to potential value

Data Mining Part 5. Prediction

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Deep profiling of multitube flow cytometry data Supplemental information

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Introduction to Clustering

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Cluster analysis with SPSS: K-Means Cluster Analysis

Data Mining Algorithms Part 1. Dejan Sarka

Statistical Databases and Registers with some datamining

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Knowledge Discovery and Data Mining

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi

Machine Learning for Data Science (CS4786) Lecture 1

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Transcription:

CPSC 340: Machine Learning and Data Mining K-Means Clustering Fall 2015

Admin Assignment 1 solutions posted after class. Tutorials for Assignment 2 on Monday.

Random Forests Random forests are one of the best out of the box classifiers. Fit deep decision trees to random bootstrap samples of data, base splits on random subsets of the features, and classify using mode.

Random Forests Random forests are one of the best out of the box classifiers. Fit deep decision trees to random bootstrap samples of data, base splits on random subsets of the features, and classify using mode.

Classifying Cancer Types I collected gene expression data for 1000 different types of cancer cells, can you tell me the different classes of cancer? X = We are not given the class labels y, but want meaningful labels. An example of unsupervised learning. https://corelifesciences.com/human-long-non-coding-rna-expression-microarray-service.html

Unsupervised Learning Supervised learning: We have features x i and class labels y i. Write a program that produces y i from x i. Unsupervised learning: We only have x i values, but no explicit target labels. You want to do something with them. Some unsupervised learning tasks: Outlier detection: Is this a normal x i? Data visualization: What does the high-dimensional X look like? Association rules: Which x ij occur together? Latent-factors: What parts are the x i made from? Ranking: Which are the most important x i? Clustering: What types of x i are there?

Clustering: Clustering Input: set of objects described by features x i. Output: an assignment of objects to groups. Unlike classification, we are not given the groups. Algorithm must discover groups. Example of groups we might discover in e-mail spam: Lucky winner group. Weight loss group. Nigerian prince group.

Clustering Example

Clustering Example

Data Clustering General goal of clustering algorithms: Objects in the same group should be similar. Objects in different groups should be different. But the best clustering is hard to define: We don t have a test error. Generally, there is no best method in unsupervised learning. Means there are lots of methods: we ll focus on important/representative ones. Why cluster? You could want to know what the groups are. You could want a prototype example for each group. You could want to find the group for a new example x. You could want to find objects related to a new example x.

http://jvi.asm.org/content/86/20/11096.abstract Clustering of Epstein-Barr Virus

http://www.eecs.wsu.edu/~cook/dm/lectures/l9/index.html http://www.biology-online.org/articles/canine_genomics_genetics_running/figures.html http://wannabite.com/wp-content/uploads/2014/10/ragu-pasta-sauce-printable-coupon.jpg Other Clustering Applications NASA: what types of stars are there? Biology: are there sub-species? Documents: what kinds of documents are on my HD? Commercial: what kinds of customers do I have? Clothing: what sizes of clothing should I make?

K-Means Most popular clustering method is k-means. Input: The number of clusters k. Initial guesses of the mean of each cluster. Algorithm: Assign each x i to its closest mean. Update the means based on the assignment. Repeat until convergence.

K-Means Example Start with k initial means (usually, random data points)

K-Means Example Assign each object to the closest mean.

K-Means Example Update the mean of each group.

K-Means Example Assign each object to the closest mean.

K-Means Example Update the mean of each group.

K-Means Example Assign each object to the closest mean.

K-Means Example Update the mean of each group.

K-Means Example Assign each object to the closest mean.

K-Means Example Update the mean of each group. Stop if no objects change groups.

Cost of K-means The bottleneck is calculating distance from x i to mean c: Each time we do this costs O(d) to go through all features. For each of the n objects, we compute the distance to k clusters. Total cost of assigning objects to clusters is O(ndk). Fast if k is not too large. Updating means is cheaper: O(nd).

K-Means Issues Guaranteed to converge when using Euclidean distance. Clustering a new object: Assign to the nearest mean. Assumes you know k. Each object is assigned to one (and only one) cluster: No possibility to leave objects unassigned. It may converge to sub-optimal local solution

K-Means Clustering with Different Initialization

K-Means Initialization Classic approach to dealing with sensitivity to initialization: Try several different random starting points, choose the best. Newer approach: K-Means++ Choose a random data point as the first mean. Compute the distance of every point to the closets mean. Sample the next proportional to these distances squared. K-Means++ tends to give means that are far apart. Can prove it yields an approximation to optimal K-means clustering.

Vector Quantization K-means originally comes from signal processing. Designed for vector quantization: Replace vectors (objects) with a set of prototypes (means). Example: Facebook places:

Vector Quantization: Image Colors Usual RGB representation of a pixel s color: three 8-bit numbers. For example, [241 13 50] =. Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-Means Quantized: (6-bits/pixel)

Vector Quantization: Image Colors Usual RGB representation of a pixel s color: three 8-bit numbers. For example, [241 13 50] =. Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-Means Quantized: (3-bits/pixel)

Vector Quantization: Image Colors Usual RGB representation of a pixel s color: three 8-bit numbers. For example, [241 13 50] =. Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-Means Quantized: (2-bits/pixel)

Vector Quantization: Image Colors Usual RGB representation of a pixel s color: three 8-bit numbers. For example, [241 13 50] =. Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-Means Quantized: (1-bits/pixel)

What is K-Means Doing? We can interpret K-Means as trying to minimize an objective: Sum of distances from each object xi to its center: We alternate between: Updating cluster assignments c(i). Updating means µ c. Convergence follows because Each step does not increase the objective. There are a finite number of assignments to k clusters.

K-Medoids With other distances, k-means may not converge. However, changing objective function gives convergent algorithms. E.g., we can use the L1-norm: A k-medoids algorithm based on the L1-norm optimizes: Cluster assignment based on the L1-norm. Update medoids by setting them to the median. This approach is more robust to outliers.

Summary Unsupervised learning: fitting data without explicit labels. Clustering: finding groups of related objects. K-means: simple iterative clustering strategy. Vector quantization: replacing measurements with prototypes. K-medoids: generalization to other distance functions. Next time: Non-parametric clustering.