Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Similar documents

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Support Vector Machines with Clustering for Training with Very Large Datasets

Final Project Report

Machine learning for algo trading

Medical Informatics II

Gene Selection for Cancer Classification using Support Vector Machines

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data, Measurements, Features

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

ADVANCED MACHINE LEARNING. Introduction

Support Vector Machine (SVM)

Introduction to Pattern Recognition

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Scalable Developments for Big Data Analytics in Remote Sensing

Exploratory data analysis for microarray data

Unsupervised and supervised dimension reduction: Algorithms and connections

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Simple and efficient online algorithms for real world applications

Making Sense of the Mayhem: Machine Learning and March Madness

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Data Mining Part 5. Prediction

Data Mining - Evaluation of Classifiers

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

TIETS34 Seminar: Data Mining on Biometric identification

Biomedical Big Data and Precision Medicine

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Feature Selection vs. Extraction

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

The Artificial Prediction Market

Data Mining. Nonlinear Classification

diagnosis through Random

Microarray Data Mining: Puce a ADN

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Big learning: challenges and opportunities

Environmental Remote Sensing GEOG 2021

Multi-class Classification: A Coding Based Space Partitioning

Morphological analysis on structural MRI for the early diagnosis of neurodegenerative diseases. Marco Aiello On behalf of MAGIC-5 collaboration

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Using multiple models: Bagging, Boosting, Ensembles, Forests

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America

MS1b Statistical Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Predict Influencers in the Social Network

Azure Machine Learning, SQL Data Mining and R

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

A fast multi-class SVM learning method for huge databases

Protein Protein Interaction Networks

ALCHEMIST (Adjuvant Lung Cancer Enrichment Marker Identification and Sequencing Trials)

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Neural Networks and Support Vector Machines

Going Big in Data Dimensionality:

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

The Data Mining Process

Data Mining Practical Machine Learning Tools and Techniques

High-Dimensional Data Visualization by PCA and LDA

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Introduction to Machine Learning Using Python. Vikram Kamath

Machine Learning for Data Science (CS4786) Lecture 1

LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK

Unsupervised Data Mining (Clustering)

Statistics Graduate Courses

: Introduction to Machine Learning Dr. Rita Osadchy

Classification and Regression by randomforest

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Supervised and unsupervised learning - 1

Linear Threshold Units

Sub-class Error-Correcting Output Codes

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

Identification algorithms for hybrid systems

Maximum Margin Clustering

Evaluation of Machine Learning Techniques for Green Energy Prediction

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Bootstrapping Big Data

A MapReduce based distributed SVM algorithm for binary classification

SAP HANA Enabling Genome Analysis

Guide for Data Visualization and Analysis using ACSN

Support Vector Machines Explained

Network Intrusion Detection using Semi Supervised Support Vector Machine

Music Mood Classification

Survey of clinical data mining applications on big data in health informatics

Transcription:

Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical Diagnosis Mario Rosario Guarracino January 9, 2007 1/12/2007 6:53 PM

Acknowledgements prof. Franco Giannessi U. of Pisa, prof. Panos Pardalos CAO UFL, Onur Seref CAO UFL, Claudio Cifarelli HP. January 9, 2007 -- Pg. 2

Agenda Mathematical models of supervised learning Purpose of incremental learning Subset selection algorithm Initial points selection Accuracy results Conclusion and future work January 9, 2007 -- Pg. 3

Introduction Supervised learning refers to the capability of a system to learn from examples (training set). The trained system is able to provide an answer (output) for each new question (input). Supervised means the desired output for the training set is provided by an external teacher. Binary classification is among the most successful methods for supervised learning. January 9, 2007 -- Pg. 4

Applications Many applications in biology and medicine: Tissues that are prone to cancer can be detected with high accuracy. Identification of new genes or isoforms of gene expressions in large datasets. New DNA sequences or proteins can be tracked down to their origins. Analysis and reduction of data spatiality and principal characteristics for drug design. January 9, 2007 -- Pg. 5

Problem characteristics Data produced in biomedical application will exponentially increase in the next years. Gene expression data contain tens of thousand characteristics. In genomic/proteomic application, data are often updated, which poses problems to the training step. Current classification methods can over-fit the problem, providing models that do not generalize well. January 9, 2007 -- Pg. 6

Linear discriminant planes Consider a binary classification task with points in two linearly separable sets. There exists a plane that classifies all points in the two sets A B There are infinitely many planes that correctly classify the training data. January 9, 2007 -- Pg. 7

SVM classification A different approach, yielding the same solution, is to maximize the margin between support planes Support planes leave all points of a class on one side A B Support planes are pushed apart until they bump into a small set of data points (support vectors). January 9, 2007 -- Pg. 8

SVM classification Support Vector Machines are the state of the art for the existing classification methods. Their robustness is due to the strong fundamentals of statistical learning theory. The training relies on optimization of a quadratic convex cost function, for which many methods are available. Available software includes SVM-Lite and LIBSVM. These techniques can be extended to the nonlinear discrimination, embedding the data in a nonlinear space using kernel functions. January 9, 2007 -- Pg. 9

A different religion Binary classification problem can be formulated as a generalized eigenvalue problem (GEPSVM). Find x w 1 =γ 1 the closer to A and the farther from B: A B O. Mangasarian et al., (2006) IEEE Trans. PAMI January 9, 2007 -- Pg. 10

ReGEC technique Let [w 1 γ 1 ] and [w m γ m ] be eigenvectors associated to min and max eigenvalues of Gx=λHx: a A closer to x'w 1 -γ 1 =0 than to x'w m -γ m =0, b B closer to x'w m -γ m =0 than to x'w 1 -γ 1 =0. M.R. Guarracino et al., (2007) OMS. January 9, 2007 -- Pg. 11

Nonlinear classification When classes cannot be linearly separated, nonlinear discrimination is needed. 2 1 0 1 2 2 1 0 1 2 Classification surfaces can be very tangled. This model accurately describes original data, but does not generalize to new data (over-fitting). January 9, 2007 -- Pg. 12

How to solve the problem? January 9, 2007 -- Pg. 13

Incremental classification A possible solution is to find a small and robust subset of the training set that provides comparable accuracy results. A smaller set of points: reduces the probability of over-fitting the problem, iscomputationally more efficient in predicting new points. As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated with respect to the small subset. January 9, 2007 -- Pg. 14

I-ReGEC: Incremental learning algorithm 1: Γ 0 = C \ C 0 2: {M 0, Acc 0 }= Classify( C; C 0 ) 3: k = 1 4: while Γ k > 0 do 5: x k = x : max x {Mk Γ k-1 } {dist(x, P class(x) )} 6: {M k, Acc k }= Classify( C; {C k-1 {x k }} ) 7: if Acc k > Acc k-1 then 8: C k = C k-1 {x k } 9: k = k + 1 10: end if 11: Γ k = Γ k-1 \{x k } 12: end while January 9, 2007 -- Pg. 15

I-ReGEC overfitting ReGEC accuracy=84.44 I-ReGEC accuracy=85.49 2 1 0 1 2 2 1 0 1 2 When ReGEC algorithm is trained on all points, surfaces are affected by noisy points (left). I-ReGEC achieves clearly defined boundaries, preserving accuracy (right). Less then 5% of points needed for training! 2 1 0 1 2 2 1 0 1 2 January 9, 2007 -- Pg. 16

Initial points selection Unsupervised clustering techniques can be adapted to select initial points. We compare the classification obtained with k randomly selected starting points for each class, and k points determined by k-means method. Results show higher classification accuracy and a more consistent representation of the training set, when k-means method is used instead of random selection. January 9, 2007 -- Pg. 17

Initial points selection Starting points C i chosen: randomly (top), k-means (bottom). For each kernel produced by C i, a set of evenly distributed points x is classified. The procedure is repeated 100 times. Let y i {1; -1} be the classification based on C i. y = y i estimates the probability x is classified in one class. random acc=84 std = 0.05 k-means acc=85 std = 0.01 January 9, 2007 -- Pg. 18

Initial points selection Starting points C i chosen: randomly (top), k-means (bottom). For each kernel produced by C i, a set of evenly distributed points x is classified. The procedure is repeated 100 times. Let y i {1; -1} be the classification based on C i. y = y i estimates the probability x is classified in one class. random acc=72.1std = 1.45 k-means acc=97.6 std = 0.04 January 9, 2007 -- Pg. 19

Initial point selection Effect of increasing initial points k with k-means on Chessboard dataset. 1 0.9 0.8 0.7 0.6 0 10 20 30 40 50 60 70 80 90 100 The graph shows the classification accuracy versus the total number of initial points 2k from both classes. This result empirically shows that there is a minimum k, for which maximum accuracy is reached. January 9, 2007 -- Pg. 20

Initial point selection Bottom figure shows k vs. the number of additional points included in the incremental dataset. 1 0.9 0.8 0.7 0.6 0 10 20 30 40 50 60 70 80 90 100 12 10 8 6 4 2 0 10 20 30 40 50 60 70 80 90 100 January 9, 2007 -- Pg. 21

Dataset reduction Experiments on real and synthetic datasets confirm training data reduction. Dataset Banana chunk 15.7 I-ReGEC % of train 3.92 German 29.09 4.15 Diabetis 16.63 35 Haberman 79 2.76 Bupa 15.28 4.92 Votes 25.9 6.62 WPBC 4.215 4.25 Thyroid 12.40 8.85 Flare-solar 9.67 1.45 January 9, 2007 -- Pg. 22

Accuracy results Classification accuracy with incremental techniques well compare with standard methods Dataset Banana German Diabetis Haberman ReGEC train acc 400 84.44 700 70.26 468 746 275 73.26 I-ReGEC chunk k acc 15.70 5 85.49 29.09 8 73 16.63 5 74.13 79 2 73.45 SVM acc 89.15 75.66 76.21 71.70 Bupa 310 59.03 15.28 4 63.94 69.90 Votes 391 95.09 25.90 10 93.41 95.60 WPBC 99 58.36 42.15 2 60.27 63.60 Thyroid 140 92.76 12.40 5 94.01 95.20 Flaresolar 666 58.23 9.67 3 65.11 65.80 January 9, 2007 -- Pg. 23

Positive results Incremental learning, in conjunction with ReGEC, reduces training sets dimension. Accuracy results well compare with those obtained selecting all training points. Classification surfaces can be generalized. January 9, 2007 -- Pg. 24

Ongoing research Microarray technology can scan expression levels of tens of thousands of genes to classify patients in different groups. For example, it is possible to classify types of cancers with respect to the patterns of gene activity in the tumor cells. Standard methods fail to derive grouping of genes responsible of classification. January 9, 2007 -- Pg. 25

Examples of microarray analysis Breast cancer: BRCA1 vs. BRCA2 and sporadic mutations, I. Hedenfalk et al, NEJM, 2001. Prostate cancer: prediction of patient outcome after prostatectomy, Singh D. et al, Cancer Cell, 2002. Malignant gliomas survival: gene expression vs. histological classification, C. Nutt et al, Cancer Res., 2003. Clinical outcome of breast cancer, L. van t Veer et al, Nature, 2002. Recurrence of hepatocellaur carcinoma after curative resection, N. Iizuka et al, Lancet, 2003. Tumor vs. normal colon tissues, A. Alon et al, PNAS, 1999. Acute Myeloid vs. Lymphoblastic Leukemia, T. Golub et al, Science, 1999. January 9, 2007 -- Pg. 26

Feature selection techniques Standard methods need long and memory intensive computations. PCA, SVD, ICA, Statistical techniques are much faster, but can produce low accuracy results. FDA, LDA, Need for hybrid techniques that can take advantage of both approaches. January 9, 2007 -- Pg. 27

ILDC-ReGEC Simultaneous incremental learning and decremented characterization permit to acquire knowledge on gene grouping during the classification process. This technique relies on standard statistical indexes (mean µ and standard deviation σ): January 9, 2007 -- Pg. 28

ILDC-ReGEC: Golub dataset X About 100 genes out of 7129 AML All responsible of discrimination Acute Myeloid Leukemia, and Acute Lymphoblastic Leukemia. X Selected genes in agreement with previous studies. X Less then 10 patients, out of 72, needed for training. Classification accuracy: 96.86% January 9, 2007 -- Pg. 29

ILDC-ReGEC: Golub dataset Missclassified patient Different techniques agree on the miss-classified patient! January 9, 2007 -- Pg. 30

Gene expression analysis ILDC-ReGEC Incremental classification with feature selection for microarray datasets. Dataset H-BRCA1 22 x 3226 H-BRCA2 22 x 3226 H-Sporadic 22 x 3226 Singh 136 x 12600 chunk 6.11 4.28 6.80 6.87 % of train 305 21.40 34.00 5.63 features 49.85 56.48 57.15 288.23 % of features 15 1.75 1.77 2.29 Few experiments and genes selected as important for discrimination. Nutt 50 x 12625 Vantveer 98 x 24188 Iizuka 60 x 7129 Alon 62 x 2000 Golub 72 x 7129 8.29 8.10 20.14 5.43 7.25 18.42 9.31 37.30 9.70 11.15 211.66 474.35 122.63 32.43 95.39 1.68 1.96 1.72 1.62 1.34 January 9, 2007 -- Pg. 31

ILDC-ReGEC: gene expression analysis Dataset LLS SVM KLS SVM UPCA FDA SPCA FDA LUPCA FDA LSPCA FDA KUPCA FDA KUPCA FDA ILDC ReGEC H-BRCA1 22 x 3226 75.00 72.62 77.38 75.00 76.19 69.05 66.67 52.38 80.00 H-BRCA2 22 x 3226 842 77.38 72.62 79.76 69.05 72.62 64.29 63.10 85.00 H-Sporadic 22 x 3226 73.81 787 69.05 75.00 70.24 79.76 69.05 69.05 77.00 Singh 136 x 12600 91.20 90.48 n.a. n.a. 88.74 84.85 n.a. n.a. 77.86 Nutt 50 x 12625 72.22 74.60 n.a. n.a. 67.46 67.46 n.a. n.a. 76.60 Vantveer 98 x 24188 66.86 66.86 n.a. n.a. 65.33 647 n.a. n.a. 68.00 Iizuka 60 x 7129 67.10 61.90 n.a. n.a. 66.67 61.90 n.a. n.a. 69.00 Alon 62 x 2000 91.27 82.14 90.08 89.68 90.08 842 90.87 81.75 830 Golub 72 x 7129 96.83 93.65 93.25 93.25 94.44 90.08 92.06 88.10 96.86 January 9, 2007 -- Pg. 32

Conclusions ReGEC is a competitive classification method. Incremental learning reduces redundancy in training sets and can help avoiding over-fitting. Subset selection algorithm provides a constructive way to reduce complexity in kernel based classification algorithms. Initial points selection strategy can help in finding regions where knowledge is missing. IReGEC can be a starting point to explore very large problems. January 9, 2007 -- Pg. 33