Data Mining Analysis (breast-cancer data)



Similar documents
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Application of Data Mining Techniques to Model Breast Cancer Data

Data Mining - Evaluation of Classifiers

Predicting Student Performance by Using Data Mining Methods for Classification

Studying Auto Insurance Data

Experiments in Web Page Classification for Semantic Web

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Scalable Developments for Big Data Analytics in Remote Sensing

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

Crowdfunding Support Tools: Predicting Success & Failure

Chapter 6. The stacking ensemble approach

Application of Data mining in Medical Applications

KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER

Car Insurance. Prvák, Tomi, Havri

Diagnosis of Breast Cancer Using Intelligent Techniques

Equity forecast: Predicting long term stock price movement using machine learning

1. Classification problems

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining. Nonlinear Classification

International Journal of Software and Web Sciences (IJSWS)

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

An Introduction to Data Mining

Keywords data mining, prediction techniques, decision making.

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Data Mining Techniques for Prognosis in Pancreatic Cancer

Active Learning SVM for Blogs recommendation

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Integration of Data Mining and Hybrid Expert System

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

A Survey on classification & feature selection technique based ensemble models in health care domain

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

On the effect of data set size on bias and variance in classification learning

How To Understand How Weka Works

Introduction to Data Mining

Getting Even More Out of Ensemble Selection

Data Mining Practical Machine Learning Tools and Techniques

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

BREAST CANCER DIAGNOSIS VIA DATA MINING: PERFORMANCE ANALYSIS OF SEVEN DIFFERENT ALGORITHMS

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Data Mining III: Numeric Estimation

WEKA. Machine Learning Algorithms in Java

Data Mining Classification: Decision Trees

Diagnosis of Breast Cancer using Decision Tree Data Mining Technique

Supervised Learning (Big Data Analytics)

Open-Source Machine Learning: R Meets Weka

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Mining: A Hybrid Approach on the Clinical Diagnosis of Breast Tumor Patients

Evaluation & Validation: Credibility: Evaluating what has been learned

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

CS 6220: Data Mining Techniques Course Project Description

Automated Content Analysis of Discussion Transcripts

Microsoft Azure Machine learning Algorithms

Improving spam mail filtering using classification algorithms with discretization Filter

Classification algorithm in Data mining: An Overview

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

An Experimental Study on Ensemble of Decision Tree Classifiers

2 Decision tree + Cross-validation with R (package rpart)

A semi-supervised Spam mail detector

WEKA KnowledgeFlow Tutorial for Version 3-5-8

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

A Review of Missing Data Treatment Methods

How To Understand The Impact Of A Computer On Organization

Learning is a very general term denoting the way in which agents:

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

D-optimal plans in observational studies

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Data Mining in Education: Data Classification and Decision Tree Approach

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

MS1b Statistical Data Mining

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Classification Rules Mining Model with Genetic Algorithm in Cloud Computing

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

College Tuition: Data mining and analysis

Data Mining Techniques Chapter 6: Decision Trees

Decision Path Models for Patient-Specific Modeling of Patient Outcomes

Predicting Flight Delays

Car Insurance. Havránek, Pokorný, Tomášek

Impact of Feature Selection Technique on Classification

Social Media Mining. Data Mining Essentials

How To Predict Web Site Visits

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Model Combination. 24 Novembre 2009

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Data Mining in Financial Application

Strategic Management System for Effective Health Care Planning (SMS-EHCP)

Visualizing class probability estimators

A Prototype of Cancer/Heart Disease Prediction Model Using Data Mining

Advanced analytics at your hands

Data Mining Practical Machine Learning Tools and Techniques

Prediction and Diagnosis of Heart Disease by Data Mining Techniques

Bayesian networks - Time-series models - Apache Spark & Scala

A Splitting Criteria Based on Similarity in Decision Tree Learning

Transcription:

Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data mining software (developed at the University of Waikato, Hamilton, New Zealand); MATLIB 6.1 and LIBSVM (developed at the Nation Taiwan University, by Chih-Jen Lin). Contents 1 Breast-cancer-Wisconsin dataset summary...2 2 Classification - 10 fold cross validation on breast-cancer-wisconsin dataset...8 2.1 Results for: Naive Bayes...8 2.2 Results for: BP Neural Network...9 2.3 Results for: J48 decision tree (implementation of C4.5)...10 2.4 Results for: SMO (Support Vector Machine)...11 2.5 Results for: JRip (implementation of the RIPPER rule learner)...12 3 Classification Compare with other paper results...14 3.1.1 Results for training data: Naive Bayes...14 3.1.2 Results for test data: Naive Bayes...15 3.2.1 Results for training data: BP Neural Network...16 3.2.2 Results for test data: BP Neural Network...16 3.3.1 Results for training data: J48 decision tree (implementation of C4.5)...17 3.3.2 Results for test data: J48 decision tree (implementation of C4.5)...19 3.3.1 Results for training data: J48 decision tree (implementation of C4.5)...19 3.4.1 Results for training data: SMO (Support Vector Machine)...20 3.4.2 Results for test data: SMO (Support Vector Machine)...21 3.5.1 Results for training data: JRip (implementation of the RIPPER rule learner).22 3.5.2 Results for test data: JRip (implementation of the RIPPER rule learner)...23 4 Summary...25 5. Reference...26 D9115007 1

1. Breast-cancer-Wisconsin dataset summary In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset. To be consistent with the literature [1, 2] we removed the 16 instances with missing values from the dataset to construct a new dataset with 683 instances. Brief information from the UC Irvine machine learning repository: Located in breast-cancer-wisconsin sub-directory, filenames root: breast-cancer-wisconsin. Currently contains 699 instances. 2 classes (malignant and benign). 9 integer-valued attributes. Attribute Information: Table 1 shows data attribute information # Attribute Domain 1. Sample code number id number 2. Clump Thickness 1-10 3. Uniformity of Cell Size 1-10 4. Uniformity of Cell Shape 1-10 5. Marginal Adhesion 1-10 6. Single Epithelial Cell Size 1-10 7. Bare Nuclei 1-10 8. Bland Chromatin 1-10 9. Normal Nucleoli 1-10 10. Mitoses 1-10 11. Class 2 for benign, 4 for malignant Missing attribute values: 16 There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?". Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%) D9115007 2

Clump Thickness Freq. 160 140 120 100 80 60 40 20 0 139 128 104 79 69 50 44 33 23 14 1 2 3 4 5 6 7 8 9 10 Domain Figure 1: bar graph summaries for the clump thickness attributes in the training data. Uniformity of Cell Size 400 373 350 300 Freq. 250 200 150 100 50 0 67 45 52 38 30 25 19 28 6 1 2 3 4 5 6 7 8 9 10 Domain Figure 2: Bar graph summaries for the Uniformity of Cell Size attributes in the training data D9115007 3

Uniformity of Cell Shape 400 350 346 300 250 Freq. 200 150 100 50 0 58 53 58 43 32 29 30 27 7 1 2 3 4 5 6 7 8 9 10 Domain Figure 3: Bar graph summaries for the Uniformity of Cell Shape attributes in the training data Marginal Adhesion Freq. 450 400 350 300 250 200 150 100 50 0 393 58 58 55 33 23 21 13 25 4 1 2 3 4 5 6 7 8 9 10 Domain Figure 4: Bar graph summaries for the Marginal Adhesion attributes in the training data D9115007 4

Single Epithelial Cell Size 400 376 350 300 250 Freq. 200 150 100 50 0 71 44 48 39 40 11 21 31 2 1 2 3 4 5 6 7 8 9 10 Domain Figure 5: Bar graph summaries for the Single Epithelial Cell Size attributes in the training data Bare Nuclei Freq. 450 400 350 300 250 200 150 100 50 0 402 132 30 28 19 30 4 8 21 9 1 2 3 4 5 6 7 8 9 10 Domain Figure 6: Bar graph summaries for the Bare Nuclei attributes in the training data D9115007 5

Bland Chromatin Freq. 180 160 140 120 100 80 60 40 20 0 160 161 150 71 39 34 28 20 9 11 1 2 3 4 5 6 7 8 9 10 Domain Figure 7: Bar graph summaries for the Bland Chromatin attributes in the training data Normal Nucleoli Freq. 500 450 400 350 300 250 200 150 100 50 0 432 60 36 42 18 19 22 16 23 15 1 2 3 4 5 6 7 8 9 10 Domain Figure 8: Bar graph summaries for the Normal Nucleoli attributes in the training data D9115007 6

Mitoses 600 563 500 400 Freq. 300 200 100 0 35 33 12 6 3 9 8 0 14 1 2 3 4 5 6 7 8 9 10 Domain Figure 9: Bar graph summaries for the Mitoses attributes in the training data Table 2 shows data summary statistics. Domain 1 2 3 4 5 6 7 8 9 10 Sum Clump Thickness 139 50 104 79 128 33 23 44 14 69 683 Uniformity of 373 45 52 38 30 25 19 28 6 67 683 Cell Size Uniformity of 346 58 53 43 32 29 30 27 7 58 683 Cell Shape Marginal 393 58 58 33 23 21 13 25 4 55 683 Adhesion Single Epithelial 44 376 71 48 39 40 11 21 2 31 683 Cell Size Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683 Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683 Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683 Mitoses 563 35 33 12 6 3 9 8 0 14 683 Sum 2843 850 605 333 346 192 207 233 77 516 D9115007 7

2 Classification - 10 fold cross validation on breast-cancer-wisconsin dataset First we use the data mining tools WEKA to do the training data prediction. In here, we will use 10 fold cross validation on training data to calculate the machine learning rules their performance. The results are as follows: 2.1 Results for: Naive Bayes Scheme: weka.classifiers.bayes.naivebayes Relation: breast Instances: 683 Test mode: 10-fold cross-validation Time taken to build model: 0.08 seconds === Stratified cross-validation === Correctly Classified Instances 659 96.4861 % Incorrectly Classified Instances 24 3.5139 % Kappa statistic 0.9238 K&B Relative Info Score 62650.9331 % K&B Information Score 585.4063 bits 0.8571 bits/instance Class complexity order 0 637.9242 bits 0.934 bits/instance Class complexity scheme 1877.4218 bits 2.7488 bits/instance Complexity improvement (Sf) -1239.4976 bits -1.8148 bits/instance Mean absolute error 0.0362 Root mean squared error 0.1869 Relative absolute error 7.9508 % Root relative squared error 39.192 % Total Number of Instances 683 425 19 a = 2 5 234 b = 4 D9115007 8

2.2 Results for: BP Neural Network Scheme: weka.classifiers.functions.neural.neuralnetwork -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a Relation: breast Instances: 683 Test mode: 10-fold cross-validation Time taken to build model: 32.06 seconds === Stratified cross-validation === Correctly Classified Instances 650 95.1684 % Incorrectly Classified Instances 33 4.8316 % Kappa statistic 0.8943 K&B Relative Info Score 60236.9181 % K&B Information Score 562.8499 bits 0.8241 bits/instance Class complexity order 0 637.9242 bits 0.934 bits/instance Class complexity scheme 176.4694 bits 0.2584 bits/instance Complexity improvement (Sf) 461.4548 bits 0.6756 bits/instance Mean absolute error 0.0526 Root mean squared error 0.203 Relative absolute error 11.5529 % Root relative squared error 42.5578 % Total Number of Instances 683 425 19 a = 2 14 225 b = 4 D9115007 9

2.3 Results for: J48 decision tree (implementation of C4.5) Scheme: weka.classifiers.trees.j48.j48 -C 0.25 -M 2 Relation: breast Instances: 683 Test mode: 10-fold cross-validation J48 pruned tree ------------------ a2 <= 2 a6 <= 3: 2 (395.0/2.0) a6 > 3 a1 <= 3: 2 (11.0) a1 > 3 a7 <= 2 a4 <= 3: 4 (2.0) a4 > 3: 2 (2.0) a7 > 2: 4 (8.0) a2 > 2 a3 <= 2 a1 <= 5: 2 (19.0/1.0) a1 > 5: 4 (4.0) a3 > 2 a2 <= 4 a6 <= 2 a4 <= 3: 2 (11.0/1.0) a4 > 3: 4 (3.0) a6 > 2: 4 (54.0/7.0) a2 > 4: 4 (174.0/3.0) Number of Leaves : 11 Size of the tree : 21 Time taken to build model: 0.08 seconds === Stratified cross-validation === D9115007 10

Correctly Classified Instances 654 95.754 % Incorrectly Classified Instances 29 4.246 % Kappa statistic 0.9062 K&B Relative Info Score 60273.2845 % K&B Information Score 563.1897 bits 0.8246 bits/instance Class complexity order 0 637.9242 bits 0.934 bits/instance Class complexity scheme 6558.414 bits 9.6024 bits/instance Complexity improvement (Sf) -5920.4898 bits -8.6684 bits/instance Mean absolute error 0.0552 Root mean squared error 0.1962 Relative absolute error 12.123 % Root relative squared error 41.1396 % Total Number of Instances 683 432 12 a = 2 17 222 b = 4 2.4 Results for: SMO (Support Vector Machine) Scheme: weka.classifiers.functions.supportvector.smo -C 1.0 -E 1.0 -G 0.01 -A 1000003 -T 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 Relation: breast Instances: 683 Test mode: 10-fold cross-validation SMO Classifier for classes: 2, 4 BinarySMO Machine linear: showing attribute weights, not support vectors. 1.5056 * (normalized) a1 + 0.2163 * (normalized) a2 + 1.2795 * (normalized) a3 + 0.6631 * (normalized) a4 + 0.901 * (normalized) a5 D9115007 11

+ 1.5154 * (normalized) a6 + 1.2332 * (normalized) a7 + 0.7335 * (normalized) a8 + 1.2115 * (normalized) a9-2.598 Number of kernel evaluations: 16169 Time taken to build model: 0.53 seconds === Stratified cross-validation === Correctly Classified Instances 663 97.0717 % Incorrectly Classified Instances 20 2.9283 % Kappa statistic 0.9359 K&B Relative Info Score 63700.8732 % K&B Information Score 595.2169 bits 0.8715 bits/instance Class complexity order 0 637.9242 bits 0.934 bits/instance Class complexity scheme 21480 bits 31.4495 bits/instance Complexity improvement (Sf) -20842.0758 bits -30.5155 bits/instance Mean absolute error 0.0293 Root mean squared error 0.1711 Relative absolute error 6.4345 % Root relative squared error 35.8785 % Total Number of Instances 683 432 12 a = 2 8 231 b = 4 2.5 Results for: JRip (implementation of the RIPPER rule learner) Scheme: weka.classifiers.rules.jrip -F 3 -N 2.0 -O 2 -S 1 Relation: breast Instances: 683 Test mode: 10-fold cross-validation JRIP rules: D9115007 12

=========== (a2 >= 4) and (a7 >= 5) => class=4 (148.0/2.0) (a6 >= 3) and (a1 >= 7) => class=4 (50.0/0.0) (a3 >= 4) and (a4 >= 4) => class=4 (22.0/2.0) (a6 >= 4) and (a3 >= 3) => class=4 (19.0/5.0) (a7 >= 4) and (a1 >= 5) => class=4 (8.0/3.0) (a8 >= 3) and (a1 >= 6) => class=4 (2.0/0.0) => class=2 (434.0/2.0) Number of Rules : 7 Time taken to build model: 0.19 seconds === Stratified cross-validation === Correctly Classified Instances 653 95.6076 % Incorrectly Classified Instances 30 4.3924 % Kappa statistic 0.904 K&B Relative Info Score 59761.5689 % K&B Information Score 558.4083 bits 0.8176 bits/instance Class complexity order 0 637.9242 bits 0.934 bits/instance Class complexity scheme 4444.9587 bits 6.508 bits/instance Complexity improvement (Sf) -3807.0344 bits -5.574 bits/instance Mean absolute error 0.059 Root mean squared error 0.2019 Relative absolute error 12.9577 % Root relative squared error 42.3262 % Total Number of Instances 683 426 18 a = 2 12 227 b = 4 D9115007 13

3. Classification Compare with other paper results Above machine learning tools will use in this section to diagnose cancer Wisconsin dataset. To be consistent with the literature [1, 2] we removed the 16 instances with missing values from the dataset to construct a dataset with 683 instances. The first 400 instances in the dataset are chosen as the training data, and the remaining 283 as the test data. 3.1.1 Results for training data: Naive Bayes Scheme: weka.classifiers.bayes.naivebayes Relation: breast_training Instances: 400 Test mode: evaluate on training data Naive Bayes Classifier Class 2: Prior probability = 0.57 Class 4: Prior probability = 0.43 Time taken to build model: 0.02 seconds === Evaluation on training set === Correctly Classified Instances 382 95.5 % Incorrectly Classified Instances 18 4.5 % Kappa statistic 0.9086 K&B Relative Info Score 36321.4474 % K&B Information Score 357.7414 bits 0.8944 bits/instance Class complexity order 0 393.9122 bits 0.9848 bits/instance Class complexity scheme 661.9164 bits 1.6548 bits/instance Complexity improvement (Sf) -268.0042 bits -0.67 bits/instance Mean absolute error 0.0445 Root mean squared error 0.2068 Relative absolute error 9.0892 % Root relative squared error 41.794 % Total Number of Instances 400 216 13 a = 2 5 166 b = 4 D9115007 14

3.1.2 Results for test data: Naive Bayes Scheme: weka.classifiers.bayes.naivebayes Relation: breast_training Instances: 400 Test mode: user supplied test set: 283 instances Naive Bayes Classifier Class 2: Prior probability = 0.57 Class 4: Prior probability = 0.43 Time taken to build model: 0.02 seconds === Evaluation on test set === Correctly Classified Instances 277 97.8799 % Incorrectly Classified Instances 6 2.1201 % Kappa statistic 0.9431 K&B Relative Info Score 24653.9715 % K&B Information Score 242.8248 bits 0.858 bits/instance Class complexity order 0 256.4813 bits 0.9063 bits/instance Class complexity scheme 580.1143 bits 2.0499 bits/instance Complexity improvement (Sf) -323.6331 bits -1.1436 bits/instance Mean absolute error 0.024 Root mean squared error 0.1446 Relative absolute error 5.1968 % Root relative squared error 30.9793 % Total Number of Instances 283 210 5 a = 2 1 67 b = 4 D9115007 15

3.2.1 Results for training data: BP Neural Network Scheme: weka.classifiers.functions.neural.neuralnetwork -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a Relation: breast_training Instances: 400 Test mode: evaluate on training data Time taken to build model: 4.74 seconds === Evaluation on training set === Correctly Classified Instances 395 98.75 % Incorrectly Classified Instances 5 1.25 % Kappa statistic 0.9746 K&B Relative Info Score 38178.7691 % K&B Information Score 376.0348 bits 0.9401 bits/instance Class complexity order 0 393.9122 bits 0.9848 bits/instance Class complexity scheme 29.2867 bits 0.0732 bits/instance Complexity improvement (Sf) 364.6255 bits 0.9116 bits/instance Mean absolute error 0.0253 Root mean squared error 0.1094 Relative absolute error 5.1599 % Root relative squared error 22.1133 % Total Number of Instances 400 224 5 a = 2 0 171 b = 4 3.2.2 Results for test data: BP Neural Network Scheme: weka.classifiers.functions.neural.neuralnetwork -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a D9115007 16

Relation: breast_training Instances: 400 Test mode: user supplied test set: 283 instances Time taken to build model: 5 seconds === Evaluation on test set === Correctly Classified Instances 278 98.2332 % Incorrectly Classified Instances 5 1.7668 % Kappa statistic 0.9523 K&B Relative Info Score 24660.0925 % K&B Information Score 242.8851 bits 0.8583 bits/instance Class complexity order 0 256.4813 bits 0.9063 bits/instance Class complexity scheme 23.3795 bits 0.0826 bits/instance Complexity improvement (Sf) 233.1018 bits 0.8237 bits/instance Mean absolute error 0.0251 Root mean squared error 0.1212 Relative absolute error 5.4184 % Root relative squared error 25.9732 % Total Number of Instances 283 211 4 a = 2 1 67 b = 4 3.3.1 Results for training data: J48 decision tree (implementation of C4.5) Scheme: weka.classifiers.trees.j48.j48 -C 0.25 -M 2 Relation: breast_training Instances: 400 Test mode: evaluate on training data J48 pruned tree ------------------ D9115007 17

a3 <= 2 a7 <= 3: 2 (194.0/1.0) a7 > 3 a1 <= 4: 2 (7.0) a1 > 4: 4 (6.0/1.0) a3 > 2 a6 <= 2 a5 <= 4: 2 (20.0/1.0) a5 > 4: 4 (12.0) a6 > 2: 4 (161.0/9.0) Number of Leaves : 6 Size of the tree : 11 Time taken to build model: 0.02 seconds === Evaluation on training set === Correctly Classified Instances 388 97 % Incorrectly Classified Instances 12 3 % Kappa statistic 0.9391 K&B Relative Info Score 35927.8586 % K&B Information Score 353.8649 bits 0.8847 bits/instance Class complexity order 0 393.9122 bits 0.9848 bits/instance Class complexity scheme 68.7303 bits 0.1718 bits/instance Complexity improvement (Sf) 325.1819 bits 0.813 bits/instance Mean absolute error 0.0564 Root mean squared error 0.1679 Relative absolute error 11.516 % Root relative squared error 33.937 % Total Number of Instances 400 219 10 a = 2 2 169 b = 4 D9115007 18

3.3.2 Results for test data: J48 decision tree (implementation of C4.5) Scheme: weka.classifiers.trees.j48.j48 -C 0.25 -M 2 Relation: breast_training Instances: 400 Test mode: user supplied test set: 283 instances J48 pruned tree ------------------ a3 <= 2 a7 <= 3: 2 (194.0/1.0) a7 > 3 a1 <= 4: 2 (7.0) a1 > 4: 4 (6.0/1.0) a3 > 2 a6 <= 2 a5 <= 4: 2 (20.0/1.0) a5 > 4: 4 (12.0) a6 > 2: 4 (161.0/9.0) Number of Leaves : 6 Size of the tree : 11 Time taken to build model: 0.02 seconds === Evaluation on test set === Correctly Classified Instances 275 97.1731 % Incorrectly Classified Instances 8 2.8269 % Kappa statistic 0.9218 K&B Relative Info Score 23633.2044 % K&B Information Score 232.7709 bits 0.8225 bits/instance Class complexity order 0 256.4813 bits 0.9063 bits/instance Class complexity scheme 1115.1466 bits 3.9404 bits/instance Complexity improvement (Sf) -858.6653 bits -3.0342 bits/instance Mean absolute error 0.0461 Root mean squared error 0.1646 D9115007 19

Relative absolute error 9.9595 % Root relative squared error 35.2651 % Total Number of Instances 283 212 3 a = 2 5 63 b = 4 3.4.1 Results for training data: SMO (Support Vector Machine) Scheme: weka.classifiers.functions.supportvector.smo -C 1.0 -E 1.0 -G 0.01 -A 1000003 -T 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 Relation: breast_training Instances: 400 Test mode: evaluate on training data SMO Classifier for classes: 2, 4 BinarySMO Machine linear: showing attribute weights, not support vectors. 1.4364 * (normalized) a1 + 0.4204 * (normalized) a2 + 1.0846 * (normalized) a3 + 1.0712 * (normalized) a4 + 0.9297 * (normalized) a5 + 1.409 * (normalized) a6 + 1.0571 * (normalized) a7 + 0.6458 * (normalized) a8 + 1.1078 * (normalized) a9-2.3339 Number of kernel evaluations: 7446 Time taken to build model: 0.66 seconds D9115007 20

=== Evaluation on training set === Correctly Classified Instances 387 96.75 % Incorrectly Classified Instances 13 3.25 % Kappa statistic 0.9338 K&B Relative Info Score 37314.0252 % K&B Information Score 367.5177 bits 0.9188 bits/instance Class complexity order 0 393.9122 bits 0.9848 bits/instance Class complexity scheme 13962 bits 34.905 bits/instance Complexity improvement (Sf) -13568.0878 bits -33.9202 bits/instance Mean absolute error 0.0325 Root mean squared error 0.1803 Relative absolute error 6.6389 % Root relative squared error 36.4406 % Total Number of Instances 400 220 9 a = 2 4 167 b = 4 3.4.2 Results for test data: SMO (Support Vector Machine) Scheme: weka.classifiers.functions.supportvector.smo -C 1.0 -E 1.0 -G 0.01 -A 1000003 -T 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 Relation: breast_training Instances: 400 Test mode: user supplied test set: 283 instances SMO Classifier for classes: 2, 4 BinarySMO Machine linear: showing attribute weights, not support vectors. 1.4364 * (normalized) a1 D9115007 21

+ 0.4204 * (normalized) a2 + 1.0846 * (normalized) a3 + 1.0712 * (normalized) a4 + 0.9297 * (normalized) a5 + 1.409 * (normalized) a6 + 1.0571 * (normalized) a7 + 0.6458 * (normalized) a8 + 1.1078 * (normalized) a9-2.3339 Number of kernel evaluations: 7446 Time taken to build model: 0.33 seconds === Evaluation on test set === Correctly Classified Instances 279 98.5866 % Incorrectly Classified Instances 4 1.4134 % Kappa statistic 0.9617 K&B Relative Info Score 25215.9493 % K&B Information Score 248.3599 bits 0.8776 bits/instance Class complexity order 0 256.4813 bits 0.9063 bits/instance Class complexity scheme 4296 bits 15.1802 bits/instance Complexity improvement (Sf) -4039.5187 bits -14.2739 bits/instance Mean absolute error 0.0141 Root mean squared error 0.1189 Relative absolute error 3.0559 % Root relative squared error 25.4786 % Total Number of Instances 283 212 3 a = 2 1 67 b = 4 3.5.1 Results for training data: JRip (implementation of the RIPPER rule learner) Scheme: weka.classifiers.rules.jrip -F 3 -N 2.0 -O 2 -S 1 Relation: breast_training Instances: 400 D9115007 22

Test mode: evaluate on training data JRIP rules: =========== (a2 >= 3) and (a2 >= 5) => class=4 (116.0/2.0) (a6 >= 3) and (a3 >= 3) => class=4 (55.0/7.0) (a1 >= 6) and (a8 >= 4) => class=4 (5.0/0.0) (a2 >= 4) => class=4 (4.0/1.0) => class=2 (220.0/1.0) Number of Rules : 5 Time taken to build model: 0.05 seconds === Evaluation on training set === Correctly Classified Instances 389 97.25 % Incorrectly Classified Instances 11 2.75 % Kappa statistic 0.9442 K&B Relative Info Score 36393.688 % K&B Information Score 358.453 bits 0.8961 bits/instance Class complexity order 0 393.9122 bits 0.9848 bits/instance Class complexity scheme 57.2873 bits 0.1432 bits/instance Complexity improvement (Sf) 336.6249 bits 0.8416 bits/instance Mean absolute error 0.0491 Root mean squared error 0.1567 Relative absolute error 10.0299 % Root relative squared error 31.6717 % Total Number of Instances 400 219 10 a = 2 1 170 b = 4 3.5.2 Results for test data: JRip (implementation of the RIPPER rule learner) Scheme: weka.classifiers.rules.jrip -F 3 -N 2.0 -O 2 -S 1 Relation: breast_training Instances: 400 D9115007 23

Test mode: user supplied test set: 283 instances JRIP rules: =========== (a2 >= 3) and (a2 >= 5) => class=4 (116.0/2.0) (a6 >= 3) and (a3 >= 3) => class=4 (55.0/7.0) (a1 >= 6) and (a8 >= 4) => class=4 (5.0/0.0) (a2 >= 4) => class=4 (4.0/1.0) => class=2 (220.0/1.0) Number of Rules : 5 Time taken to build model: 0.08 seconds === Evaluation on test set === Correctly Classified Instances 276 97.5265 % Incorrectly Classified Instances 7 2.4735 % Kappa statistic 0.9326 K&B Relative Info Score 24255.9546 % K&B Information Score 238.9046 bits 0.8442 bits/instance Class complexity order 0 256.4813 bits 0.9063 bits/instance Class complexity scheme 40.6116 bits 0.1435 bits/instance Complexity improvement (Sf) 215.8697 bits 0.7628 bits/instance Mean absolute error 0.0329 Root mean squared error 0.1457 Relative absolute error 7.116 % Root relative squared error 31.2218 % Total Number of Instances 283 211 4 a = 2 3 65 b = 4 D9115007 24

4. Summary This section presents summary tables for scheme accuracy and running times. Table 4.1: Accuracy and running time summary table for 10 fold cross validation Model Running time 10 fold cross val. Naive Bayes 0.08 seconds 96.4861% BP neural network 32.06 seconds 95.1684% J48 decision tree (C4.5) 0.08 seconds 95.7540% SMO (support vector machine) 0.53 seconds 97.0717% JRip (RIPPER rule learner) 0.19 seconds 95.6076% Table 4.2: Accuracy for training and test data between different models Model Training data Test data Naive Bayes 95.50% 97.8799% BP neural network 98.75% 98.2332% J48 decision tree (C4.5) 97.00% 97.1731% SMO (support vector machine) 96.75% 98.5866% JRip (RIPPER rule learner) 97.25% 97.5265% Table 4.2: A comparison with the others papers Method Testing Error WEKA 3.3.6 Naive Bayes 97.8799% BP neural network 98.2332% J48 decision tree (C4.5) 97.1731% SMO (support vector machine) 98.5866% JRip (RIPPER rule learner) 97.5265% Fogel et al. [1] 98.1000% Abbass et. al. [2] 97.5000% Abbass H. A. [3] 98.1000% D9115007 25

5. Reference 1. Fogel DB, Wasson EC, Boughton EM. Evolving neural networks for detecting breast cancer. Cancer lett 1995; 96(1): 49-53. 2. Abbass HA, Towaey M, Finn GD. C-net: a method for generating non-deterministic and dynamic multivariate decision trees. Knowledge Inf. Syst. 2001; 3:184-97. 3. Abbass HA. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine 2002; 25, 265-281. D9115007 26