Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis



Similar documents
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

How To Cluster

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Statistical Databases and Registers with some datamining

K-Means Clustering Tutorial

Comparison of K-means and Backpropagation Data Mining Algorithms

Neural Networks Lesson 5 - Cluster Analysis

Machine Learning using MapReduce

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Chapter ML:XI (continued)

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Chapter 7. Cluster Analysis

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Machine Learning for Data Science (CS4786) Lecture 1

Foundations of Artificial Intelligence. Introduction to Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Self Organizing Maps: Fundamentals

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Unsupervised Data Mining (Clustering)

MS1b Statistical Data Mining

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Predictive Modeling and Big Data

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Going Big in Data Dimensionality:

Exploratory data analysis for microarray data

Big Data & Scripting Part II Streaming Algorithms

Data Mining Techniques in CRM

Unsupervised learning: Clustering

A Computational Framework for Exploratory Data Analysis

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

A comparison of various clustering methods and algorithms in data mining

Clustering Connectionist and Statistical Language Processing

A survey on Data Mining based Intrusion Detection Systems

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Hierarchical Cluster Analysis Some Basics and Algorithms

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Azure Machine Learning, SQL Data Mining and R

Machine Learning with MATLAB David Willingham Application Engineer

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Machine learning for algo trading

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Fuzzy Signature Neural Network

Map/Reduce Affinity Propagation Clustering Algorithm

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

They can be obtained in HQJHQH format directly from the home page at:

Environmental Remote Sensing GEOG 2021

Time series experiments

Social Media Mining. Data Mining Essentials

Colour Image Segmentation Technique for Screen Printing

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Standardization and Its Effects on K-Means Clustering Algorithm

Pest Control in Agricultural Plantations Using Image Processing

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Cluster Analysis: Basic Concepts and Algorithms

Data Mining and Visualization

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

SoSe 2014: M-TANI: Big Data Analytics

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Predictive Modeling Techniques in Insurance

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

A Study of Web Log Analysis Using Clustering Techniques

Data Mining Individual Assignment report

Learning outcomes. Knowledge and understanding. Competence and skills

Role of Neural network in data mining

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Evaluation of Machine Learning Techniques for Green Energy Prediction

Maschinelles Lernen mit MATLAB

CS Data Science and Visualization Spring 2016

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

A Relevant Document Information Clustering Algorithm for Web Search Engine

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Exploratory Data Analysis with MATLAB

Segmentation & Clustering

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

University of Manchester Health Data Science Masters Modules

Quality Assessment in Spatial Clustering of Data Mining

: Introduction to Machine Learning Dr. Rita Osadchy

Introduction to Machine Learning Using Python. Vikram Kamath

Transcription:

Exploratory data analysis approaches unsupervised approaches Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Lecture overview Page 1 Ø Background Ø Revision Ø Other clustering methods

Background Page 2

Motivation Page 3 The most challenging task for a scientist is to make sense of lots of data The power of high-throughput analysis does not come from the analysis of single genes, but rather, from the analysis of many data points to identify patterns of gene expression Unsupervised learning allows unexpected patterns to be spotted

Supervised v.s. Unsupervised learning Page 4 Supervised Inputs Observations Observations Outputs

Supervised v.s. Unsupervised learning Page 5 Supervised Observations Inputs Unsupervised Unobserved, or not used in initial analysis Observations Observations Outputs

Clustering Page 6 Finding a partition such that: - Distance between objects within partition is minimised - Distance between objects from different cluster is maximised

Applications Page 7 Biology finding similar organisms, sequences, molecular signatures Marketing identify groups of customers with similar preferences Earthquakes Predict epicentre based on recordings Images Image compression Many more

Revision Page 8

Hiearchical clustering revision Page 9 In R: hclust In MATLAB: clusterdata

Hiearchical clustering revision Page 10 In R: hclust In MATLAB: clusterdata

Hiearchical clustering revision Page 11 In R: hclust In MATLAB: clusterdata

Hiearchical clustering revision Page 12 In R: hclust In MATLAB: clusterdata

Page 13

Page 14

Page 15

Page 16

Screen Shot 2014-01-09 at 16.52.35 Page 17

Page 18

Principal Components Analysis (PCA) revision Page 19 http:// gettinggeneticsdone.blogspot.co.uk/ In R: prcomp In MATLAB: princomp

Principal Components Analysis (PCA) revision Page 20 http:// gettinggeneticsdone.blogspot.co.uk/ In R: prcomp In MATLAB: princomp

Other clustering methods Page 21

Multidimensional scaling (MDS) Page 22 PCA is a special case of MDS. MDS can use non-linear transformations of data points. Aflalo et al 2013

Automatic identification of clusters Page 23 Standard approaches: Ø K-means Ø K-centre

K-means example Page 24 Li et al., (2010)

K-means algorithm Page 25 In R: kmeans In MATLAB: kmeans 1) k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color). Wikimedia

K-means algorithm Page 26 2) k clusters are created by associating every observation with the nearest mean. Wikimedia

K-means algorithm Page 27 3) The centroid of each of the k clusters becomes the new mean. Wikimedia

K-means algorithm Page 28 4) Steps 2 and 3 are repeated until convergence has been reached. Wikimedia

K-means disadvantages Page 29 Assumes clusters have same variance as each other in all directions. Ø Expectation-Maximisation (EM) clustering Requires a distance measure with a defined mean, such as Euclidean distance (the Pythagoras/ordinary distance). Ø K-centres Wikimedia

Unusual distance measures Page 30 Flight time can be affected by wind, busy airports etc Wikimedia

K-centres (also called K-medians) Page 31 Same data points as we used for k-means In R: pam {cluster} In MATLAB: kcenters http://www.psi.toronto.edu/index.php?q=affinity%20propagation Wikimedia

K-centres (also called K-medians) Page 32 1) k data points selected at random. Wikimedia

K-centres (also called K-medians) Page 33 2) k clusters are created by associating every observation with the nearest centre/median. Wikimedia

K-centres (also called K-medians) Page 34 3) For each cluster, the observation with the shortest distance to the rest of the cluster is chosen as the new centre/median. Wikimedia

K-centres (also called K-medians) Page 35 2 3 2 4) Steps 2 & 3 are repeated until convergence has been reached Wikimedia

Example of results Page 36 Wikimedia

Even more clustering methods Page 37 K-means style Ø Expectation Maximisation Ø Self Organising Maps Ø Neural Networks K-centres style Ø Affinity Propagation (Frey et al., 2007) Ø Simulated Annealing For time series Ø SplineCluster (Heard et al., 2006) Wikimedia