Introduction to Data Mining And Predictive Analytics!



Similar documents
Introduction To Predictive Analytics And Data Mining!

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Information Management course

Social Media Mining. Data Mining Essentials

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

The Data Mining Process

An Overview of Knowledge Discovery Database and Data mining Techniques

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Part 5. Prediction

Neural Networks Lesson 5 - Cluster Analysis

Introduction to Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Cluster

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Introduction Predictive Analytics Tools: Weka

Data Preprocessing. Week 2

Data Mining Techniques

Principles of Data Mining by Hand&Mannila&Smyth

An Introduction to WEKA. As presented by PACE

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 7. Cluster Analysis

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining: Overview. What is Data Mining?

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining Analytics for Business Intelligence and Decision Support

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Using Data Mining for Mobile Communication Clustering and Characterization

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Analytics on Big Data

Data Mining Algorithms Part 1. Dejan Sarka

Clustering UE 141 Spring 2013

Chapter 12 Discovering New Knowledge Data Mining

A Review of Data Mining Techniques

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Data Warehousing and Data Mining

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Sanjeev Kumar. contribute

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

BUSINESS ANALYTICS. Data Pre-processing. Lecture 3. Information Systems and Machine Learning Lab. University of Hildesheim.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Environmental Remote Sensing GEOG 2021

An Introduction to Data Mining

Distances, Clustering, and Classification. Heatmaps

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Data Mining Solutions for the Business Environment

DATA MINING TECHNIQUES AND APPLICATIONS

Cluster Analysis: Advanced Concepts

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining - Evaluation of Classifiers

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining and Neural Networks in Stata

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Chapter 20: Data Analysis

Chapter ML:XI (continued)

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining Classification: Decision Trees

Using multiple models: Bagging, Boosting, Ensembles, Forests

Unsupervised Data Mining (Clustering)

The Scientific Data Mining Process

Data Mining for Knowledge Management. Classification

Data Mining: Data Preprocessing. I211: Information infrastructure II

Azure Machine Learning, SQL Data Mining and R

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Comparison of K-means and Backpropagation Data Mining Algorithms

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

CS Introduction to Data Mining Instructor: Abdullah Mueen

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Machine Learning using MapReduce

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Building Data Cubes and Mining Them. Jelena Jovanovic

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Unsupervised learning: Clustering

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining for Fun and Profit

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Statistical Models in Data Mining

DATA MINING - SELECTED TOPICS

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Transcription:

Introduction to Data Mining And Predictive Analytics!! Paul Rodriguez, Ph.D.! Predictive Analytics Center of Excellence,! San Diego Supercomputer Center! University of California, San Diego!

PACE: Closing the gap between Government, Industry and Academia! Bridge the Industry and Academia Gap Provide Predictive Analytics Services Data Mining Repository of Very Large Data Sets Inform, Educate and Train Predictive Analysis Center of Excellence Foster Research and Collaboration Develop Standards and Methodology High Performance Scalable Data Mining PACE is a non-profit, public educational organization! o To promote, educate and innovate in the area of Predictive Analytics! o To leverage predictive analytics to improve the education and well being of the global population and economy! o To develop and promote a new, multi-level curriculum to broaden participation in the field of predictive analytics!!!

Introduction to! WHAT IS DATA MINING?!

Necessity is the Mother of Invention! Data explosion! Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories! n We are drowning in data, but starving for knowledge! (John Naisbitt, 1982)! 4

Necessity is the Mother of Invention! Data explosion! n..explosion of heterogeneous, multi-disciplinary, Earth Science data has rendered traditional means of integration and analysis ineffective, necessitating the application of new analysis methods for synthesis, comparison, visualization (Hoffman et al, Data Mining in Earth System Science 2011)! 5

What does Data Mining Do? Explores Your Data! Finds Patterns! Performs Predictions!

Data Mining perspective Top-Down: Hypothesis Driven Analytical Tools Bottom-Up: Data Driven Surface Shallow Corporate Data Hidden SQL tools for simple queries and reporting Statistical & OLAP tools for summaries and analysis Data Mining methods for knowledge discovery 7

Query Reporting OLAP Data Mining Extraction of data Descriptive summaries Discovery of hidden patterns, information Information Analysis Insight knowledge and prediction Who purchased the product in the last 2 quarters? What is an average income of the buyers per quarter by district? What is the relationship between customers and their purchases? 8

What Is Data Mining in practice?! Combination of AI and statistical analysis to discover information that is hidden in the data! associations (e.g. linking purchase of pizza with beer)! sequences (e.g. tying events together: marriage and purchase of furniture)! classifications (e.g. recognizing patterns such as the attributes of employees that are most likely to quit)! forecasting (e.g. predicting buying habits of customers based on past patterns)! 9

Multidisciplinary Field! Database Technology Statistics Machine Learning Data Mining Visualization Artificial Intelligence Other Disciplines 10

TAXONOMY! Predictive Methods! Use some variables to predict some unknown or future values of other variables! Descriptive Methods! Find human interpretable patterns that describe the data! Supervised (outcomes or labels provided)vs. Unsupervised (just data)!!!!

What can we do with Data Mining?! Exploratory Data Analysis! Cluster analysis/segmentation! Predictive Modeling: Classification and Regression! Discovering Patterns and Rules! Association/Dependency rules! Sequential patterns! Temporal sequences! Anomaly detection! 12

13 Data Mining Applications! Science: Chemistry, Physics, Medicine, Energy! Biochemical analysis, remote sensors on a satellite, Telescopes star galaxy classification, medical image analysis! Bioscience! Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression! Pharmaceutical companies, Insurance and Health care, Medicine! Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits! Financial Industry, Banks, Businesses, E-commerce! Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting!

Data Mining Tasks! n Concept/Class description: Characterization and discrimination n Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions; normal vs. fraudulent behavior n Association (correlation and causality) n Multi-dimensional interactions and associations age(x, 20-29 ) ^ income(x, 60-90K ) à buys(x, TV ) Hospital(area code) ^ procedure(x) ->claim (type) ^ claim(cost) 14

Data Mining Tasks! n Classification and Prediction n Finding models (functions) that describe and distinguish classes or concepts for future prediction n Example: classify countries based on climate, or classify cars based on gas mileage, fraud based on claims information, energy usage based on sensor data n Presentation: n If-THEN rules, decision-tree, classification rule, neural network n Prediction: Predict some unknown or missing numerical values 15

Data Mining Tasks! Cluster analysis! Class label is unknown: Group data to form new classes! Example: cluster claims or providers to find distribution patterns of unusual behavior! Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity! 16

Data Mining Tasks n Outlier analysis n Outlier: a data object that does not comply with the general behavior of the data n Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis n Trend and evolution analysis n Trend and deviation: regression analysis n Sequential pattern mining, periodicity analysis 17

KDD Process! Database Selection Transformation Data Preparation Training Data Data Mining Model, Patterns Evaluation, Verification 18

Steps of a KDD Process (1) n Learning the application domain: n relevant prior knowledge and goals of application n Creating a target data set: data selection n Data cleaning and preprocessing: (may take 60% of effort!) n Data reduction and transformation: n Find useful features, dimensionality/variable reduction, representation n Choosing functions of data mining n summarization, classification, regression, association, clustering 19

Steps of a KDD Process (2) n Choosing functions of data mining n summarization, classification, regression, association, clustering n Choosing the mining algorithm(s) n Data mining: search for patterns of interest n Pattern evaluation and knowledge presentation n visualization, transformation, removing redundant patterns, etc. n Use and integration of discovered knowledge 20

Learning and Modeling Methods! Decision Tree Induction (C4.5, J48)!!! Regression Tree Induction (CART, MP5)! Multivariate Adaptive Regression Splines (MARS)!! Clustering (K-means, EM, Cobweb)!!! Artificial Neural Networks (Backpropagation, Recurrent)! Support Vector Machines (SVM)! Various other models! 21

Decision Tree Induction! Method for approximating discrete-valued functions! robust to noisy/missing data! can learn non-linear relationships! inductive bias towards shorter trees!! 22

Decision Tree Induction!! Applications:! medical diagnosis ex. heart disease! analysis of complex chemical compounds! classifying equipment malfunction! risk of loan applicants! Boston housing project price prediction! fraud detection!! 23

Decision Tree Example!

Regression Tree Induction! Why Regression tree?! Ability to:! Predict continuous variable! Model conditional effects! Model uncertainty! 25

Regression Trees! Continuous goal variables! Induction by means of an efficient recursive partitioning algorithm! Uses linear regression to select internal nodes! Quinlan, 1992 26

Clustering Basic idea: Group similar things together Unsupervised Learning Useful when no other info is available K-means Partitioning instances into k disjoint clusters Measure of similarity 27

Clustering! X X X X 28

Kmeans Results from 45,000 NYTimes articles cluster means shown with coordinates determining fontsize 7 viable clusters found 29

Artificial Neural Networks (ANNs)!! Network of many simple units Main Components Inputs Hidden layers Outputs 30 Adjusting weights of connections Backpropagation

Evaluation! Error on the training data vs. performance on future/unseen data! Simple solution! Split data into training and test set! Re-substitution error! error rate obtained from the training data! Three sets! training data, validation data, and test data!

Training and Testing! Test set! set of independent instances that have not been used in formation of classifier in any way! Assumption! data contains representative samples of the underlying problem! Example: classifiers built using customer data from two different towns A and B! To estimate performance of classifier from town in completely new town, test it on data from B!

Error Estimation Methods! Cross-validation! Partition in K disjoint clusters! Train k-1, test on remaining! Leave-one-out Method! Bootstrap! Sampling with replacement!

Data Mining Challenges!! Computationally expensive to investigate all possibilities! Dealing with noise/missing information and errors in data! Mining methodology and user interaction! Mining different kinds of knowledge in databases! Incorporation of background knowledge! Handling noise and incomplete data! Pattern evaluation: the interestingness problem! Expression and visualization of data mining results! 34

Data Mining Heuristics and Guide! Choosing appropriate attributes/input representation! Finding the minimal attribute space! Finding adequate evaluation function(s)! Extracting meaningful information! Not overfitting! 35

Model Recommendations! Usually no absolute choice and no silver bullets! (otherwise we wouldn t be here)! Start with simple methods! Consider trade off as you go more complex! Find similar application examples! (what works in this domain)! Find paradigmatic examples for models! (what works for this model)! Goals and Expectations!!!

Available Data Mining Tools! COTs:! n IBM Intelligent Miner! n SAS Enterprise Miner! n Oracle ODM! n Microstrategy! n Microsoft DBMiner! n Pentaho! n Matlab! n Teradata! Open Source:! n WEKA! n KNIME! n Orange! n RapidMiner! n NLTK! n R! n Rattle! 37

On the Importance of Data Prep! Garbage in, garbage out! A crucial step of the DM process! Could take 60-80% of the whole data mining effort!

Preprocessing Input/Output! Inputs:! raw data! Outputs:! two data sets: training and test (if available)! Training further broken into training and validation!

Variables and Features terms! Variables and their transformations are features! Instance labels are outcomes or dependent variables! Set of instances comprise the input data matrix! Often represented as a single flat file! Data Matrix size is often refer by N observations and P variables! Large N Small P => usually solvable! Small N Large P => not always solvable directly,!!!!!needs heuristics!

Types of Measurements! Nominal (names)! Categorical (zip codes)! Qualitative (unordered, non-scalar) Ordinal (H,M,L)! Real Numbers! May or may not have a! Natural Zero Point?! Quantitative (ordered, scalar) If not comparisons are OK but not multiplication (e.g. dates)!

Know variable properties! Variables as objects of study! explore characteristics of each variable:! typical values, min, max, range etc.! entirely empty or constant variables can be discarded! explore variable dependencies! Sparsity! missing, N/A, or 0?! Monotonicity! increasing without bound, e.g. dates, invoice numbers! new values not in the training set!

Data Integration Issues! When combining data from multiple sources:! integrate metadata (table column properties)! the same attribute may have different names in different databases! e.g.a.patient-id B.patient-number!! Using edit-distance can help resolve entities!! Detecting and resolving data value conflicts! e.g. attribute values may have different scales or nominal labels!

Noise in Data! Noise is unknown error source! often assumed random Normal distribution! often assumed independent between instances! Approaches to Address Noise! Detect suspicious values and remove outliers! Smooth by averaging with neighbors! Smooth by fitting the data into regression functions!

Missing Data! Important to review statistics of a missing variable! Are missing cases random?! Are missing cases related to some other variable?! Are other variables missing data in same instances?! Is there a relation between missing cases and outcome variable?! What is frequency of missing cases wrt all instances?!

Approaches to Handle Missing Data! Ignore the tuple: usually done when class label is missing or if there s plenty of data! Use a global constant to fill in ( impute ) the missing value:!(or a special unknown value for nominals)! Use the attribute mean to impute the missing value! Use the attribute mean to impute all samples belonging to the same class! Use an algorithm to impute missing value! Use separate models with/without values!

Variable Enhancements! Make analysis easy for the tool! if you know how to deduce a feature, do it yourself and don t make the tool find it out! to save time and reduce noise! include relevant domain knowledge! Getting enough data! Do the observed values cover the whole range of data?!

Data Transformation: Normalizations! Mean center!! z-score! z xnew score = = x mean(x) x mean(x) std(x) Scale to [0 1]! xnew = x min(x) max(x)- min(x) log scaling! xnew = log(x)!

VariableTransformation Summary! Smoothing: remove noise from data! Aggregation: summarization! Introduce/relabel/categorize variable values! Normalization: scaled to fall within a small, specified range! Attribute/feature construction!

Pause!

Data Exploration and Unsupervised Learning with Clustering! Paul F Rodriguez,PhD! San Diego Supercomputer Center! Predictive Analytic Center of Excellence!!

Clustering Idea! Given a set of data can we find a natural grouping?! X 2 X 1

Why Clustering! A good grouping implies some structure! In other words, given a good grouping,! we can then:! Interpret and label clusters! Identify important features! Characterize new points by the closest cluster (or nearest neighbors)! Use the cluster assignments as a compression or summary of the data!!

Clustering Objective! Objective: find subsets that are similar within cluster and dissimilar between clusters! Similarity defined by distance measures! Euclidean distance! Manhattan distance! Mahalanobis! (Euclidean w/dimensions! rescaled by variance)!!

Kmeans Clustering! A simple, effective, and standard method!!start with K initial cluster centers!!loop:!!!assign each data point to nearest cluster center!!!calculate mean of cluster for new center!!stop when assignments don t change! Issues:!!How to choose K?!!How to choose initial centers?!!will it always stop?!!

Kmeans Example! For K=1, using Euclidean distance, where will the cluster center be?! X 2 X 1

Kmeans Example! For K=1, the overall mean minimizes Sum Squared Error (SSE), aka Euclidean distance!

Kmeans Example! K=1 K=2 K=3 K=4 As K increases individual points get a cluster

Choosing K for Kmeans! Total Within Cluster SSE K=1 to 10 - Not much improvement after K=2 ( elbow )

Kmeans Example more points! How many clusters should there be?

Choosing K for Kmeans! Total Within Cluster SSE K=1 to 10 - Smooth decrease at K 2, harder to choose - In general, smoother decrease => less structure

Kmeans Guidelines! Choosing K:! Elbow in total-within-cluster SSE as K=1 N! Cross-validation: hold out points, compare fit as K=1 N! Choosing initial starting points:! take K random data points, do several Kmeans, take best fit! Stopping:! may converge to sub-optimal clusters! may get stuck or have slow convergence (point assignments bounce around), 10 iterations is often good!!

Kmeans Example uniform! K=1 K=2 K=3 K=4

Choosing K - uniform! Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

Kmeans Clustering Issues! Deviations:! Dimensions with large numbers may dominate distance metrics! Kmeans doesn t model group variances explicitly (so groups with unequal variances may get blurred)! Outliers:! Outliers can pull cluster mean, K-mediods uses median instead of mean!!!

Soft Clustering Methods! Fuzzy Clustering! Use weighted assignments to all clusters! Weights depend on relative distance! Find min weighted SSE! Expectation-Maximization:! Mixture of multivariate Gaussian distributions! Mixture weights are missing! Find most likely means & variances, for the expectations of the data given the weights!!

Kmeans unequal cluster variance!

Choosing K unequal distributions! Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

EM clustering! Selects K=2 (using Bayesian Information Criterion) Handles unequal variance

Kmeans computations! Distance of each point to each cluster center! For N points, D dimensions: each loop requires N*D*K operations!! Update Cluster centers! only track points that change, get change in cluster center! On HPC:! Distance calculations can be partitioned data across dimension!

Dimensionality Reduction via Principle Components! Idea: Given N points and P features (aka dimensions), can we represent data with fewer features:! Yes, if features are constant! Yes, if features are redundant! Yes, if features only contribute noise (conversely, want features that contribute to variations of the data)!

Dimensionality Reduction via Principle Components! PCA:! Find set of vector (aka factors) that describe data in alternative way! First component is the vector that maximizes the variance of data projected onto that vector! K-th component is orthogonal to all k-1 previous components!

PCA on 2012 Olympic Althetes Height by Weight scatter plot Height- cm (mean centered)! Idea: Can you rotate the axis so that the data lines up on one axis as much as possible? Or, find the direction in space with maximum variance of the data? Weight- Kg (mean centered)

Height- cm (mean centered) PCA on 2012 Olympic Althetes Height by Weight scatter plot PC2! PC1 Projection of (145,5) to PCs Total Variance Conserved: Var in Weight + Var in Height = Var in PC1 + Var in PC2 In general: Var in PC1> Var in PC2> Var in PC3 Weight- Kg (mean centered)

SVD (=PCA for non-square matrix)! Informally, X NxP = U NxP X S PxP X V PxP!!!!!!!= X X! Taking k<p => X NxP ~ U NxK X S KxK X V KxP! U corresponds to observations in k-dimensional factor space! (eg each row in U is a point in K dimensional space, AKA factor loadings for data)! V corresponds to measurements in K factor space! (eg each row in V is a point in k-dim. factor space, AKA factor loadings for indepented variable)! V corresponds to factors in original space!! (eg each row in V is a point in P dimensional data space)!

Principle Components! Can choose k heuristically as approximation improves,!or choose k so that 95% of data variance accounted! aka Singular Value Decomposition!!PCA on square matrices only!!svd, more general! Works for numeric data only! In contrast, clustering reduces to categorical groups! In some cases, k PCs k clusters!!

Summary! Having no label doesn t stop you from finding structure in data! Unsupervised methods are somewhat related!

Earth System Sciences Example! An Investigation Using SiroSOM for the Analysis of QUEST Stream-Sediment and Lake-Sediment Geochemical Data (Fraser & Hodgkinson 2009) at! http://www.geosciencebc.com/s/2009-14.asp!!!(e.g. Quest_Out_Data.csv)!! Summary: identify patterns and relationships anomalous data promote resource assessment and exploration! Data: 15,000 rows x 42 measures (and metadata)!

Data location and spreadsheet!

N=15,000 x P=42 correlation plot!

Standard Deviation of Columns! Standard Deviation

Kmeans SSE as K=1..30! K=1 30; Authors used 20

K=20, heat map of centroids! Minerals

Try Using Matrix Factorization (SVD), use 3 factors, then take Kmeans on those! K=1 30, Data is N=15000 x 3 factors

Plots of all data points on 2 factors at a time (the U matrix from svd() (you could also plot V matrix)

PC1xPC2 with Cluster Coded!

PC1xPC3 with Cluster Coded!

Kmeans vs SiroSOM w/kmeans! Fraser: used SOM (self organizing map, a non-linear, iteractive matrix factorization technique), then Kmeans Kmeans, k=20 on raw data gives 58% overlap with article 3 SVD factors, then Kmeans k=20 gives 29% overlap 10 SVD factors, then Kmeans K=20 gives 38% overlap SOM w/kmeans cluster My cluster assigment

Plotting in Physical Space! Fraser (l) cluster assignment vs Kmeans alone (r)!

End! If time, hierarchical clustering next!

Incremental & Hierarchical Clustering! Start with 1 cluster (all instances) and do splits!!!!or! Start with N clusters (1 per instance) and do merges!! Can be greedy & expensive in its search!!some algorithms might merge & split!!algorithms need to store and recalculate distances! Need distance between groups!!in constrast to K-means!

Incremental & Hierarchical Clustering! Result is a hierarchy of clusters! displayed as a dendrogram tree! Useful for tree-like interpretations! syntax (e.g. word co-occurences)! concepts (e.g. classification of animals)! topics (e.g. sorting Enron emails)! spatial data (e.g. city distances)! genetic expression (e.g. possible biological networks)! exploratory analysis!

Incremental & Hierarchical Clustering! Clusters are merged/split according to distance or utility measure! Euclidean distance (squared differences)! conditional probabilities (for nominal features)!! Options to choose which clusters to Link! single linkage, mean, average (w.r.t. points in clusters)!!(may lead to different trees, depending on spreads)! Ward method (smallest increase within cluster variance)! change in probability of features for given clusters!

Linkage options! e.g. single linkage (closest to any cluster instance)! Cluster1 Cluster2 e.g. mean (closest to mean of all cluster instances)! Cluster1 Cluster2

Linkage options (cont )! e.g. average (mean of pairwise distances)! Cluster1 Cluster2 e.g. Ward s method (find new cluster with min. variance)! Cluster1 Cluster2

Hierarchical Clustering Demo! 3888 Interactions among 685 proteins! From Hu et.al. TAP dataset http://www.compsysbio.org/ bacteriome/dataset/)! b0009 b0014 0.92 b0009 b2231 0.87 b0014 b0169 1.0 b0014 b0595 0.76 b0014 b2614 1.0 b0014 b3339 0.95 b0014 b3636 0.9 b0015 b0014 0.99.

Hierarchical Clustering Demo! Essential R commands: >d=read.table("hu_tap_ppi.txt"); >str(d) #show d structure 'data.frame': 3888 obs. of 3 variables: $ V1: Factor w/ 685 levels "b0009","b0014",..: 1 1 2 2 2 2 2 3 3 3... $ V2: Factor w/ 536 levels "b0011","b0014",..: 2 248 28 66 297 396... $ V3: num 0.92 0.87 1 0.76 1 0.95 0.9 0.99 0.99 0.93... >fs =c(d[,1],d[,2]); #combine factor levels >str(fs) int [1:7776] 1 1 2 2 2 2 2 3 3 3... Note: strings read as factors

Hierarchical Clustering Demo! Essential R commands: C =matrix(0,p,p); #Connection matrix (aka Adjacency matrix) IJ =cbind(d[,1],d[,2]) #factor level is saved as Nx2 list of i-th,j-th protein for (i in 1:N) {C[IJ[i,1],IJ[i,2]]=1;} #populate C with 1 for connections install.packages('igraph') library('igraph') gc=graph.adjacency(c,mode="directed") plot.graph(gc,vertex.size=3,edge.arrow.size=0,vertex.label=na) or just plot(.

Hierarchical Clustering Demo! hclust with single distance: chaining!! d2use=dist(c,method="binary") fit <- hclust(d2use, method= single") plot(fit) the cluster distance hen 2 are combined Items that cluster first

Hierarchical Clustering Demo! hclust with Ward distance: spherical clusters!

Hierarchical Clustering Demo! Where height change looks big, cut off tree! groups <- cutree(fit, k=7) rect.hclust(fit, k=7, border="red")

Hierarchical Clustering Demo! Kmeans vs Hierarchical:! Lots of overlap! despite that Kmeans not have binary distance option!!! Hierarchical Group Assigment Kmeans cluster assigment 1! 2! 3! 4! 5! 6! 7! 1! 26! 0! 0! 293! 0! 19! 23! 2! 2! 1! 0! 9! 0! 0! 46! 3! 72! 0! 0! 27! 0! 2! 1!! groups <- cutree(fit, k=7) Kresult=kmeans(d2use,7,10 table(kresult$cluster,group