CS 525 Class Project Breast Cancer Diagnosis via Quadratic Programming Fall, 2015 Due 15 December 2015, 5:00pm

Similar documents
Support Vector Machine (SVM)

Lecture 2: August 29. Linear Programming (part I)

Support Vector Machines Explained

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Lecture 2: The SVM classifier

Mathematical Programming for Data Mining: Formulations and Challenges

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Mining Analysis (breast-cancer data)

Massive Data Classification via Unconstrained Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Big Data Analytics CSCI 4030

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning Final Project Spam Filtering

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

: Introduction to Machine Learning Dr. Rita Osadchy

A Simple Introduction to Support Vector Machines

Lecture 3: Linear methods for classification

Linear Algebra Notes

Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values

Music Mood Classification

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

A Detailed Price Discrimination Example

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

EXCEL SOLVER TUTORIAL

Solving Systems of Linear Equations

Beginner s Matlab Tutorial

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Week 1: Functions and Equations

α = u v. In other words, Orthogonal Projection

Performance Metrics for Graph Mining Tasks

Principal components analysis

Nonlinear Programming Methods.S2 Quadratic Programming

NOTES ON LINEAR TRANSFORMATIONS

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Numerical Analysis Lecture Notes

Lecture 6: Logistic Regression

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Mining the Software Change Repository of a Legacy Telephony System

Chapter 6. The stacking ensemble approach

Name: Section Registered In:

Data Mining Techniques for Prognosis in Pancreatic Cancer

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Statistical Machine Learning

RSVM: Reduced Support Vector Machines

Unit 1 Equations, Inequalities, Functions

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

MATH2210 Notebook 1 Fall Semester 2016/ MATH2210 Notebook Solving Systems of Linear Equations... 3

Support Vector Machines

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

1 Topic. 2 Scilab. 2.1 What is Scilab?

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Supporting Online Material for

The Graphical Method: An Example

Solutions to Math 51 First Exam January 29, 2015

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

Linearly Independent Sets and Linearly Dependent Sets

OPTIMAL DISPATCH OF POWER GENERATION SOFTWARE PACKAGE USING MATLAB

the points are called control points approximating curve

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Answer Key for California State Standards: Algebra I

MATH APPLIED MATRIX THEORY

Data Mining: Algorithms and Applications Matrix Math Review

Phyllodes tumours: borderline malignant and malignant

Dimensionality Reduction: Principal Components Analysis

Introduction to Support Vector Machines. Colin Campbell, Bristol University

1. Classification problems

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Linear Programming. March 14, 2014

Recycling elearning Assets. elearning Instructional Design: Best Practices, Tips, and Techniques. Joel Gendelman, Future Technologies

Supervised and unsupervised learning - 1

Practice with Proofs

OBJECTIVES By the end of this segment, the community participant will be able to:

Predicting Flight Delays

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Enhancing Data Visualization Techniques

Linear Programming. Solving LP Models Using MS Excel, 18

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Big Data: a new era for Statistics

Computational Assignment 4: Discriminant Analysis

What is Linear Programming?

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

CURVE FITTING LEAST SQUARES APPROXIMATION

BREAK-EVEN ANALYSIS. In your business planning, have you asked questions like these?

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

171:290 Model Selection Lecture II: The Akaike Information Criterion

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

not possible or was possible at a high cost for collecting the data.

Least Squares Estimation

Transcription:

CS 525 Class Project Breast Cancer Diagnosis via Quadratic Programming Fall, 2015 Due 15 December 2015, 5:00pm In this project, we apply quadratic programming to breast cancer diagnosis. We use the Wisconsin Diagnosis Breast Cancer Database (WDBC) made publicly available by Dr. William H. Wolberg (Department of Surgery, UW Medical School), Professor W. Nick Street (Management Sciences Department, University of Iowa), and Prof. O. L. Mangasarian (Computer Sciences Department, UW). The database is available thorugh the class web site, as the files wdbc.data and wdbc.names. You should read the file wdbc.names, which explains some background information and a description of the structure of the data file. The idea of the project is to come up with a discriminant function a separating plane in this case to determine whether an unknown tumor sample is benign or malignant. In order to do so, you will use part of the data in the above database as a training set to generate the separating plane and the remaining part as a testing set to test the effectiveness the separating plane. Attributes 3 to 32 of each piece of data form a 30-dimensional vector a point in 30-dimensional real space R 30. A training set, consisting of two disjoint point sets in R 30 representing confirmed benign and malignant fine needle aspirates (FNAs), is used to generate the separating plane. (Figure 1 shows the different appearance of malignant and benign samples.) Each November 4, 2015 1

Figure 1: Nuclei of cells of malignant(left) and benign(right) fine needle aspirates taken from patients breasts sample point is labeled either B of M, to indicate whether it is benign or malignant. We can therefore form two sets of points in the space R 30 : a set B of points corresponding to the benign tumors, and a set M of malignant tumors. Our aim is to compute a plane in R 30 which separates these two sets, as much as possible. When a new sample point is obtained, we diagnose it as benign or malignant depending on whether it lies on the B side of the separating plane, or on the M side. We say as much as possible because it may be that the data in B and M is interspersed in a way that makes it impossible to find a plane that performs the separation cleanly. In this case, we seek a plane that minimizes some measure of the classification error. This classification error for a point is strictly positive if it lies on the wrong side of the plane (for example, the point represents a benign sample but lies on the M side of the plane), and zero otherwise. Algebraically, the separating plane is defined as a linear function f with the following desired property: f(x) > 0 = x M, f(x) 0 = x B. This function is given by f(x) = w x γ, where w R 30 and γ R are to be determined from the training data. The quantities w and γ are determined 2

by solving a quadratic program in MATLAB. The form of this quadratic program is similar to one proposed in [1, 4]; see also [3]. If represent the sets of m points M by a matrix M R m n and the set of k points B by a matrix B R k n, then the problem becomes one of choosing w and γ to solve the following minimization problem: min w,γ 1 ( Mw + em γ + e m ) 1 m + + 1 (Bw ek γ + e k ) 1 k + Here e m and e k are vectors of lengths m and k, respectively, whose entries are all 1, while ((z) + ) i = max{z i, 0}, i = 1, 2,..., m and z 1 = m i=1 z i for z R m. This problem approximately minimizes the number of points that are misclassified by choosing w and γ to minimize the sum of the distances (times w 2 ) to the separating plane whenever a point is on the incorrect side of the plane. The ( ) + and 1 functions can be eliminated to yield the following linear programming reformulation: ( min 1 w,γ,y,z m e y + 1 k e z ) subject to Mw eγ + y e, Bw + eγ + z e, y 0, z 0. If the data sets M and B are separable, then this linear program may have multiple solutions. In order to choose an appropriate solution from these, typically w is chosen to maximize the separation margin between the two datasets. It can be shown that the separation margin is given by the reciprocal of w 2, so we modify the above formulation to add a multiple of the two-norm of w to the objective: ( min 1 w,γ,y,z m e y + 1 k e z ) + µ 2 w w subject to Mw eγ + y e, Bw + eγ + z e, y 0, z 0. Here µ is a penalty parameter. As its value is increased, the norm of the solution w tends to decrease. This quadratic programming formulation is the one you will work with in this project. You can obtain a Matlab file wdbcdata.m that contains a function that reads the data from wdbc.data and stores it in test and training matrices. See the comments at the start of this file for details. The project consists of the following four parts. 3

1. Write code in Matlab and CVX to solve the QP formulation above. (a) Test your routine on the training data obtained by setting fractest=0.1 and reord=0 in help wdbcdata, using the value µ = 0.001. (This will yield a training data set with 512 data points, consisting of records 1 through 512 from the file wdbc.data.) (b) Test your routine on the training data obtained by setting fractest=0.15 and reord=1 in help wdbcdata, using µ = 0.001. (This will yield a training matrix of 484 records randomly selected from the 569 samples in the file wdbc.data.) Make sure you print out w, γ and the optimal objective of the quadratic program. Also, print out the number of misclassified points in the training set the number of points in the training set that lie on the wrong side of the calculated plane. 2. By modifying your code for part 1, write a program to obtain the separating plane on the training set, and then determine the number of misclassified points on the corresponding testing set, for the following cases (use µ = 0.001 for each): (a) fractest=0.1 and reord=0 (b) fractest=0.15 and reord=0 (c) fractest=0.20 and reord=1 (d) fractest=0.05 and reord=1 Print fractest, reord, and the number of misclassified points in the testing set, for each case. (Do not print w, γ, or the optimal objective for these cases.) 3. Suppose that the oncologist wants to use only 2 of the 30 attributes to make the diagnosis. Determine which pair of attributes is most effective in obtaining a correct diagnosis as follows. First, obtain a training set ( by setting fractest=0.12 and reord=0. Considering each of the 30 ) 2 = 435 pairs of possible attributes, use the training set and the formulation above to determine a separating plane in R 2. Use µ = 0.0008 throughout. For each plane use the training set with the corresponding pair of attributes to determine the number of misclassified cases. 4

Each time you find an attribute pair that gives the fewest number of training-set misclassifications encountered so far, print out a line of the form fprintf( attributes %2d %2d: misclass %3d\n,i,j, wrong); (Note: Do not print out this line for every one of the 435 pairs! Only do it when you find a pair that gives the best results so far.) 4. Apply the best performing answer from Part 3 above to the testing set. First, determine the number of incorrectly classified points in the testing set. Then, plot all the testing set points on a two dimensional figure using MATLAB s plotting routines. Use o for benign points and + for malignant points in the plot. Use MATLABcommands to draw the calculated separating plane w x = γ on the plot. Check to see if the number of misclassified points agrees with the plot, and comment. (Note that some points may coalesce, so you may want to randomly perturb points by a small amount to visualize all the points.) Hand in listings of your code and output, and a short (approximately one page) summary and discussion of your results. References [1] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23 34, 1992. [2] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Machine Learning Proceedings of the Fifteenth International Conference(ICML 98), pages 82 90, San Francisco, California, 1998. Morgan Kaufmann. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-03.ps. [3] M. C. Ferris and O. L. Mangasarian. Breast cancer diagnosis via linear programming. IEEE Computational Science and Engineering, 2:70 71, 1995. 5

[4] O. L. Mangasarian, W. N. Street, and W. H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43:570 577, 1995. 6