ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)



Similar documents
Machine Learning using MapReduce

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

K-Means Clustering Tutorial

Machine Learning for Data Science (CS4786) Lecture 1

Social Media Mining. Data Mining Essentials

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Neural Networks Lesson 5 - Cluster Analysis

Using Data Mining for Mobile Communication Clustering and Characterization

Clustering UE 141 Spring 2013

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

An Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

How To Cluster

An Overview of Knowledge Discovery Database and Data mining Techniques

Unsupervised learning: Clustering

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Cluster Analysis: Basic Concepts and Algorithms

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Information Retrieval and Web Search Engines

Map-Reduce for Machine Learning on Multicore

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Machine Learning Introduction

Machine Learning: Overview

Foundations of Artificial Intelligence. Introduction to Data Mining

Environmental Remote Sensing GEOG 2021

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Distances, Clustering, and Classification. Heatmaps

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Clustering Connectionist and Statistical Language Processing

Chapter 7. Cluster Analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Self Organizing Maps: Fundamentals

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Machine learning for algo trading

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

LVQ Plug-In Algorithm for SQL Server

HT2015: SC4 Statistical Data Mining and Machine Learning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Cluster Analysis: Basic Concepts and Algorithms

Role of Neural network in data mining

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Classification algorithm in Data mining: An Overview

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to Clustering

Introduction to Clustering

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Cluster analysis Cosmin Lazar. COMO Lab VUB

Machine Learning with MATLAB David Willingham Application Engineer

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Predicting borrowers chance of defaulting on credit loans

Data Mining - Evaluation of Classifiers

TIETS34 Seminar: Data Mining on Biometric identification

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

MS1b Statistical Data Mining

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Comparison of K-means and Backpropagation Data Mining Algorithms

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

B490 Mining the Big Data. 2 Clustering

Course 395: Machine Learning

Machine Learning Capacity and Performance Analysis and R

Learning is a very general term denoting the way in which agents:

Vector Quantization and Clustering

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Personalized Hierarchical Clustering

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

The Integration of SNORT with K-Means Clustering Algorithm to Detect New Attack

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Chapter ML:XI (continued)

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Classification Techniques (1)

Keywords data mining, prediction techniques, decision making.

Analysis Tools and Libraries for BigData

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Machine Learning Using Python. Vikram Kamath

Maschinelles Lernen mit MATLAB

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Transcription:

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications of clustering Review of relevant concepts Euclidean distance Voronoi diagrams K-Means algorithm Description and Demo K-Means as an optimisation problem A real-world example Summary & what s next? 2 1

PRELIMINARIES Machine learning helps us navigate and process large volumes of data Examples of questions about our data What is this data point most similar to? Does the data come in patterns? Can we predict what will happen in the future, given past trends? 3 MACHINE LEARNING TASKS Classification Clustering Supervised learning Make a prediction given evidence We studied Decision Trees, but there are many other methods Useful when you have labeled data Unsupervised learning Detect patterns in unlabelled data. Examples: group emails or search results find categories of customers detect anomalous program executions Useful when don t know what you re looking for Requires data, but no labels 4 2

APPLICATIONS OF CLUSTERING Economy (market research) Discover distinct groups in their customer bases, use this knowledge to develop targeted marketing programs. Internet and WWW Document classification Cluster Weblog data to discover groups of similar access patterns Pattern recognition Blind signal separation: imaging a recording of two voices with a microphone. Task to separate the two voices into separate signals Image processing Astronomy - aggregation of stars, galaxies, or supergalaxies Medicine separating healthy from diseased tissue 5 CLUSTERING Basic idea: group together similar instances Example: 2D point patterns What could similar mean? One option: small (squared) Euclidean distance 3

DISTANCE BETWEEN POINTS The distance between two points is the length of the path connecting them In the plane, the distance between points (x 1, y 1 ) and (x 2, y 2 ) is given by the Pythagoras Theorem: In Euclidean three-space, the distance between (x 1, y 1, z 1 ) and (x 2, y 2, z 2 ) is: In general for n dimensions, the distance between x and y is: 7 VORONOI DIAGRAM The partitioning of a plane with n points into convex polygons Each polygon contains exactly one generating point Every point in a given polygon is closer to its generating point than to any other. Image fro Wikipedia Weisstein, Eric W. "Voronoi Diagram." From MathWorld http://mathworld.wolfram.com/voronoidiagram.html 8 4

K-MEANS Javascript K-Means Demo An iterative clustering algorithm 1. Plot data points 2. Create K additional points, placing them randomly. This points are the cluster centroids 3. Repeat: Assign each data point to the cluster centroid closest to it Move the centroid to the average position of all the data points that belong to If any of the centroids moved repeat else exit. K-MEANS EXAMPLE 5

K-MEANS AS OPTIMISATION Consider the total distance to the means: points means assignments Each iteration reduces phi Two stages each iteration: Update assignments: fix means c, change assignments a Update means: fix assignments a, change means c PHASE I: UPDATE ASSIGNMENTS For each point, reassign to closest mean: Can only decrease total distance phi! 6

PHASE II: UPDATE MEANS Move each mean to the average of its assigned points: Also can only decrease total distance Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean INITIALISATION K-means is nondeterministic Requires initial means It does matter what you pick! What can go wrong? Various schemes for preventing this kind of thing: Multiple restarts Variance-based split / merge, Initialization heuristics 7

K-MEANS GETTING STUCK A local optimum: K-Means has some drawbacks Several methods have been proposed to overcome them Still very much used in practice Main limitation, need to suggest number of clusters K in advance Why doesn t this work out like the earlier example, with the purple taking over half the blue? EXAMPLE: GOOGLE NEWS Top-level categories: supervised classification Story groupings: unsupervised clustering 16 8

SUMMARY: MAIN MACHINE LEARNING TASKS Supervised learning Unsupervised learning Inferring a function from labeled training data The training data consist of a set of training examples Each example is a pair has an input vector and a desired output value Algorithms: Neural networks Bayesian methods Kernel estimators Nearest neighbor, etc. Trying to find hidden structure in unlabelled data Requires data, but no labels Examples are unlabeled, there is no error or reward signal to evaluate a potential solution Algorithms: Clustering: (k-means, mixture models, hierarchical clustering Expectation-maximisation algorithm PCA, self-organised maps, etc. 17 9