Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center



Similar documents
Big Data Analytics for Healthcare

Exploration and Visualization of Post-Market Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

ADVANCED MACHINE LEARNING. Introduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Unsupervised and supervised dimension reduction: Algorithms and connections

Visualization by Linear Projections as Information Retrieval

Using multiple models: Bagging, Boosting, Ensembles, Forests

Machine Learning for Data Science (CS4786) Lecture 1

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Data, Measurements, Features

Personalized Predictive Modeling and Risk Factor Identification using Patient Similarity

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

Unsupervised Data Mining (Clustering)

Feature Selection vs. Extraction

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

TIETS34 Seminar: Data Mining on Biometric identification

Manifold Learning Examples PCA, LLE and ISOMAP

Social Media Mining. Data Mining Essentials

Supervised and unsupervised learning - 1

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Active Learning SVM for Blogs recommendation

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Performance Metrics for Graph Mining Tasks

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Going Big in Data Dimensionality:

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Using EHRs for Heart Failure Therapy Recommendation Using Multidimensional Patient Similarity Analytics

How To Cluster

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

CLUSTER ANALYSIS FOR SEGMENTATION

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Patient Similarity-guided Decision Support

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Making Sense of the Mayhem: Machine Learning and March Madness

A Survey on Pre-processing and Post-processing Techniques in Data Mining

How To Perform An Ensemble Analysis

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Dimensionality Reduction: Principal Components Analysis

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Chapter ML:XI (continued)

Statistical Databases and Registers with some datamining

Clustering Technique in Data Mining for Text Documents

Predict Influencers in the Social Network

Distance Metric Learning for Large Margin Nearest Neighbor Classification

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

A Boosting Framework for Visuality-Preserving Distance Metric Learning and Its Application to Medical Image Retrieval

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

SYMMETRIC EIGENFACES MILI I. SHAH

Machine Learning using MapReduce

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

: Introduction to Machine Learning Dr. Rita Osadchy

Joint Feature Learning and Clustering Techniques for Clustering High Dimensional Data: A Review

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Machine learning for algo trading

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Knowledge Discovery from patents using KMX Text Analytics

Network Intrusion Detection using Semi Supervised Support Vector Machine

Data Mining Practical Machine Learning Tools and Techniques

Ranking on Data Manifolds

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining and Machine Learning in Bioinformatics

System Behavior Analysis by Machine Learning

Self Organizing Maps: Fundamentals

A comparison of various clustering methods and algorithms in data mining

Big Data Analytics and Healthcare

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Environmental Remote Sensing GEOG 2021

Data Mining: Algorithms and Applications Matrix Math Review

Crime Hotspots Analysis in South Korea: A User-Oriented Approach

Statistics for BIG data

Colour Image Segmentation Technique for Screen Printing

The Scientific Data Mining Process

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining: Overview. What is Data Mining?

Nonlinear Discriminative Data Visualization

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

Azure Machine Learning, SQL Data Mining and R

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

An Effective Way to Ensemble the Clusters

Evaluation of Machine Learning Techniques for Green Energy Prediction

Large-Scale Similarity and Distance Metric Learning

Data Mining. Nonlinear Classification

Final Project Report

1 Spectral Methods for Dimensionality

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Neural Networks Lesson 5 - Cluster Analysis

Transcription:

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1

Outline Part I - Applications Motivation and Introduction Patient similarity application Part II - Methods Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 2

Outline Part I - Applications Motivation and Introduction Patient similarity application Part II - Methods Unsupervised Metric Learning Supervised Metric Learning Semi-supervised Metric Learning Challenges and Opportunities for Metric Learning 3

What is a Distance Metric? From wikipedia 4

Distance Metric Taxonomy Distance Characteristics Linear Nonlinear Learning Metric Unsupervised Algorithms Supervised Distance Metric Semisupervised Discrete Numeric Euclidean Fixed Metric Continuous Categorical Mahalanobis 5

Categorization of Distance Metrics: Linear vs. Non-linear Linear Distance First perform a linear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space Generalized Mahalanobis distance If M is symmetric, and positive definite, D is a distance metric; If M is symmetric, and positive semi-definite, D is a pseudo-metric. Nonlinear Distance First perform a nonlinear mapping to project the data into some space, and then evaluate the pairwise data distance as their Euclidean distance in the projected space 6

Categorization of Distance Metrics: Global vs. Local Global Methods The distance metric satisfies some global properties of the data set PCA: find a direction on which the variation of the whole data set is the largest LDA: find a direction where the data from the same classes are clustered while from different classes are separated Local Methods The distance metric satisfies some local properties of the data set LLE: find the low dimensional embeddings such that the local neighborhood structure is preserved LSML: find the projection such that the local neighbors of different classes are separated. 7

Related Areas Metric learning Dimensionality reduction Kernel learning 8

Related area 1: Dimensionality reduction Dimensionality Reduction is the process of reducing the number of features through Feature Selection and Feature Transformation. Feature Selection Find a subset of the original features Correlation, mrmr, information gain Feature Transformation Transforms the original features in a lower dimension space PCA, LDA, LLE, Laplacian Embedding Each dimensionality reduction technique can map a distance metric, where we first perform dimensionality reduction, then evaluate the Euclidean distance in the embedded space 9

Related area 2: Kernel learning What is Kernel? Suppose we have a data set X with n data points, then the kernel matrix K defined on X is an nxn symmetric positive semi-definite matrix, such that its (i,j)-th element represents the similarity between the i-th and j-th data points Kernel Leaning vs. Distance Metric Learning Kernel learning constructs a new kernel from the data, i.e., an inner-product function in some feature space, while distance metric learning constructs a distance function from the data Kernel Leaning vs. Kernel Trick Kernel learning infers the n by n kernel matrix from the data, Kernel trick is to apply the predetermined kernel to map the data from the original space to a feature space, such that the nonlinear problem in the original space can become a linear problem in the feature space B. Scholkopf, A. Smola. Learning with Kernels - Support Vector Machines, Regularization, Optimization, and Beyond. 2001. 10

Different Types of Metric Learning Nonlinear Kernelization LE LLE ISOMAP SNE Kernelization Kernelization SSDR Linear PCA UMMP LPP LDA MMDA Eric RCA ITML NCA ANMM LMNN CMM LRML Unsupervised Supervised Semi-Supervised Red means local methods Blue means global methods 11

Applications of metric learning Computer vision Text analysis Medical informatics image classification and retrieval object recognition document classification and retrieval patient similarity 12

Patient Similarity Applications Q1: How to incorporate physician feedback? Supervised metric learning Q2: How to integrate patient similarity measures from multiple parties? Composite distance integration Q3: How to interactively update the existing patient similarity measure? Interactive Metric Learning 13

Q1: How to incorporate physician feedback? Leverage historical data about the similar patients to diagnose the query patient Provide positive/negative feedback about the similar patient results Physician feedback Patient Similarity Learning Retrieve and rank similar patients for a query patient Explain why they are similar 14

Method: Locally Supervised Metric Learning (LSML) Goal: Learn a Mahalanobis distance X Judgment Identifying Neighborhoods heterogeneous homogeneous Patient Population Heterogeneous Homogeneous Compute Average Distance Maximize Difference Original difference Maximizing difference Fei Wang, Jimeng Sun, Tao Li, Nikos Anerousis: Two Heads Better Than One: Metric+Active Learning and its Applications for IT Service Classification. ICDM 2009: 1022-1027 15

Experiment Evaluation Data: 1500 patients downloaded from the MIMIC II database. Among the 1500 patients, 590 patients experienced at least one occurrence of Acute Hypotensive Episode (AHE), and 910 patients did not. Physiological streams include mean ABP measure, systolic ABP, diastolic ABP, Sp02 and heart rate measurements. Every sensor is sampled at 1 minute intervals. Performance Metric: 1) classification accuracy, 2) Retrieval precision@10 Baselines: Challenge09: uses Euclidean distance of the variance of the mean ABP as suggested in [1]; PCA uses Euclidean distance over low-dimensional points after principal component analysis (PCA) (an unsupervised metric learning algorithm) LDA uses Euclidean distance over low-dimensional points after linear discriminant analysis (LDA) (a global supervised metric learning algorithm); Results: [1] X. Chen, D. Xu, G. Zhang, and R. Mukkamala. Forecasting acute hypotensive episodes in intensive care patients based on peripheral arterial blood pressure waveform. In Computers in Cardiology (CinC), 2009. 16

Patient Similarity Applications Q1: How to incorporate physician feedback? Supervised metric learning Q2: How to integrate patient similarity measures from multiple parties? Composite distance integration Q3: How to interactively update the existing patient similarity measure? Interactive Metric Learning 17

Q2: How to integrate patient similarity measures from multiple parties? Different physicians have different similarity measures Patient Population Similarity I Similarity II Similarity III How to integrate these judgments from multiple experts to a consistent similarity measure? How to learn a meaningful distance metric considering the curse of dimensionality? 18

Comdi: Composite Distance Integration Raw data Hidden Base metric Shared Physician 1 X 1 p 1 Neighborhoods Physician 2 X 2 p 2 Neighborhoods Comdi: compute the global metric Physician m X m p m Neighborhoods Knowledge sharing: integration of patient similarity from multiple physicians can compensate the experience of different physicians and achieve a better outcome. Efficient collaboration: Sharing the models across physicians can be an efficient way to collaborate across multiple hospitals and clinics. Privacy preservation: Due to the privacy rule regarding to PHI data, sharing individual patient data becomes prohibitively difficult. Therefore, model sharing provides an effective alternative. Fei Wang, Jimeng Sun, Shahram Ebadollahi: Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM 2011: 59-70 56 19

Composite Distance Integration: Formulation Patient Population Expert 1 Neighborhood 1 Neighborhood 2 Expert 2...... Expert m Neighborhood m... Combining the Objective Alternating Optimization Output a Composite Distance Metric Individual distance metrics may be of arbitrary forms and scales 20

Composite Distance Integration: Alternating optimization method Optimization Problem Fix Solve Eigenvalue Decomposition N Fix Solve Converge? Y Euclidean Projection Output the Metric 21

Experiment Evaluation Data Tasks Data source: 135K patients over one year, consisting of diagnosis. procedure and etc. We construct 247 cohorts, one for each physician Labels (physician feedback): HCC019 - Diabetes with No or Unspecified Complications Features: All the remaining HCC diagnosis information. Use 30 cohorts (select randomly) to train the models Performance metrics: 22

Results on Accuracy Share versions: learning on all 30 cohorts Secure versions: learning on 1 cohort Our method Comdi combines all 30 individual models Observations: Shared version perform better than secure versions, which indicates sharing is better Comdi is comparable to LSML, which is the best among sharing versions, which confirms its effectiveness in terms of combining multiple metrics accuracy 0.8 0.75 0.7 0.65 0.6 0.55 0.5 Classification Accuracy Comparison shared secure LSML LSDA LDA PCA EUC Comdi MLSML MLSDA MLDA MPCA MEUC 23

Positive effects on all cohorts Representative cohort is a good approximation of the entire patient distribution, which often leads to good base metric. Biased cohort is a bad approximation of the entire patient distribution, which often leads to bad base metric. The accuracy increases significantly for biased cohorts, and also still improves the accuracy for representative cohorts. 0.75 classification accuracy 0.7 0.65 0.6 representative cohort biased cohort 0.55 0 5 10 15 20 number of cohorts 24

Patient Similarity Applications Q1: How to incorporate physician feedback? Supervised metric learning Q2: How to integrate patient similarity measures from multiple parties? Composite distance integration Q3: How to interactively update the existing patient similarity measure? Interactive Metric Learning 25

imet: Incremental Metric Learning Overview Offline method Model building: build a patient similarity model from the historical data Model scoring: use the learned model to retrieve/score similar patients Disadvantage: physician feedback cannot be incorporated quickly. Interactive method Model update: receive updates in real-time to adjust the model parameters Update types: physician feedback modeled as label change How to adjust the learned patient similarity by incorporating physician s feedback in real time? How to adjust the learned patient similarity by incorporating patients feature change efficiently? 26

Locally Supervised Metric Learning Revisit Goal: X Judgment Identifying Neighborhoods heterogeneous homogeneous Patient Population Heterogeneous Homogeneous Compute Average Distance Maximize Difference 27

imet: Incremental Metric Learning Approach - Intuition The optimal W can be achieved by doing eigenvalue decomposition on Any feedback can be viewed as a increment on the matrix How to efficiently update the eigensystem of increment on L? based on the Fei Wang, Jimeng Sun, Jianying Hu, Shahram Ebadollahi: imet: Interactive Metric Learning in Healthcare Applications. SDM 2011: 944-955 28

imet: Incremental Metric Learning Approach Matrix Eigensystem Perturbation First-Order Perturbation Solution 29

Experiment evaluation Initial Metric: The patient population was clustered into 10 clusters using Kmeans with the remaining 194 HCC features. An initial distance metric was then learned using LSML. Feedback: One of the key HCC is hold off as the simulated feedback. For each round of simulated feedback, an index patient was randomly selected and 20 similar patients were selected for feedback Performance metric: precision@position measure 0.73 precision 0.72 0.71 0.7 40 position 20 0 0 5 25 20 15 10 number of feedback rounds 30

Part I Summary Definition of distance metric learning Taxonomy of distance metric learning Related areas Dimensionality reduction Kernel learning Patient similarity application Locally supervised metric learning Multiple metric integration Interactive metric learning 31