Description: 1. Title: Wisconsin Breast Cancer Database (January 8, 1991)

Similar documents
Data Mining Analysis (breast-cancer data)

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Evolutionary Neural Logic Networks in Two Medical Decision Tasks.

Diagnosis of Breast Cancer using Decision Tree Data Mining Technique

A fast multi-class SVM learning method for huge databases

Lecture 2: August 29. Linear Programming (part I)

Data Mining: A Hybrid Approach on the Clinical Diagnosis of Breast Tumor Patients

: Introduction to Machine Learning Dr. Rita Osadchy

New Ensemble Combination Scheme

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

Application of Data Mining Techniques to Model Breast Cancer Data

Integration of Data Mining and Hybrid Expert System

Enhancing Data Visualization Techniques

Package acrm. R topics documented: February 19, 2015

Data Mining: A Preprocessing Engine

BREAST CANCER DIAGNOSIS VIA DATA MINING: PERFORMANCE ANALYSIS OF SEVEN DIFFERENT ALGORITHMS

Application of Data mining in Medical Applications

D-optimal plans in observational studies

Zeta: A Global Method for Discretization of Continuous Variables

Diagnosis of Breast Cancer Using Intelligent Techniques

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

1. Classification problems

Classification Rules Mining Model with Genetic Algorithm in Cloud Computing

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Knowledge Discovery from patents using KMX Text Analytics

Diagnostic Challenge. Department of Pathology,

Categorical Data Visualization and Clustering Using Subjective Factors

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Guido Bologna. National University of Singapore. Abstract. The inherent black-box nature of neural networks is an important

A Content based Spam Filtering Using Optical Back Propagation Technique

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Classification of Bad Accounts in Credit Card Industry

Information Model Requirements of Post-Coordinated SNOMED CT Expressions for Structured Pathology Reports

Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values

The PageRank Citation Ranking: Bring Order to the Web

Network Intrusion Detection using Semi Supervised Support Vector Machine

A Comparative Study of Simple Online Learning Strategies for Streaming Data

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Splitting Criteria Based on Similarity in Decision Tree Learning

Data Mining Techniques for Prognosis in Pancreatic Cancer

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

Carcinosarcoma of the Ovary

Benchmarking Open-Source Tree Learners in R/RWeka

Evaluation of Fourteen Desktop Data Mining Tools

CS 6220: Data Mining Techniques Course Project Description

Decision Trees from large Databases: SLIQ

Raitoharju, Matti; Dashti, Marzieh; Ali-Löytty, Simo; Piché, Robert

KNOWLEDGE BASED ANALYSIS OF VARIOUS STATISTICAL TOOLS IN DETECTING BREAST CANCER

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

On the effect of data set size on bias and variance in classification learning

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Support Vector Machines with Clustering for Training with Very Large Datasets

A Review of Missing Data Treatment Methods

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Healthcare Data Mining: Prediction Inpatient Length of Stay

The Optimality of Naive Bayes

Machine Learning Logistic Regression

Linearly Independent Sets and Linearly Dependent Sets

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification

Least Squares Estimation

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Identification algorithms for hybrid systems

Prototype Internet consultation system for radiologists

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

DATA ANALYSIS II. Matrix Algorithms

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

A Survey on classification & feature selection technique based ensemble models in health care domain

Predicting Flight Delays

Knowledge-based systems and the need for learning

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

High-Dimensional Data Visualization by PCA and LDA

Assessing Data Mining: The State of the Practice

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Cytopathology Case Presentation #8

Genetic Algorithm Evolution of Cellular Automata Rules for Complex Binary Sequence Prediction

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Rulex s Logic Learning Machines successfully meet biomedical challenges.

Data Mining Yelp Data - Predicting rating stars from review text

Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

RSVM: Reduced Support Vector Machines

Machine Learning Final Project Spam Filtering

Classification algorithm in Data mining: An Overview

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

Decision Tree Learning on Very Large Data Sets

Active Learning SVM for Blogs recommendation

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL COMPONENT ANALYIS

Section Continued

A Prototype of Cancer/Heart Disease Prediction Model Using Data Mining

The Scientific Data Mining Process

Transcription:

The data set (and description) can be downloaded here: http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data Description: 1. Title: Wisconsin Breast Cancer Database (January 8, 1991) 2. Sources: -- Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals Madison, Wisconsin USA -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu) Received by David W. Aha (aha@cs.jhu.edu) -- Date: 15 July 1992 3. Past Usage: Attributes 2 through 10 have been used to represent instances. Each instance has one of 2 possible classes: benign or malignant. 1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In {\it Proceedings of the National Academy of Sciences}, {\it 87}, 9193--9196. -- Size of data set: only 369 instances (at that point in time) -- Collected classification results: 1 trial only -- Two pairs of parallel hyperplanes were found to be consistent with 50% of the data -- Accuracy on remaining 50% of dataset: 93.5% -- Three pairs of parallel hyperplanes were found to be consistent with 67% of data -- Accuracy on remaining 33% of dataset: 95.9%

2. Zhang,~J. (1992). Selecting typical instances in instance-based learning. In {\it Proceedings of the Ninth International Machine Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan Kaufmann. -- Size of data set: only 369 instances (at that point in time) -- Applied 4 instance-based learning algorithms -- Collected classification results averaged over 10 trials -- Best accuracy result: -- 1-nearest neighbor: 93.7% -- trained on 200 instances, tested on the other 169 -- Also of interest: -- Using only typical instances: 92.2% (storing only 23.1 instances) -- trained on 200 instances, tested on the other 169 4. Relevant Information: Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself: Group 1: 367 instances (January 1989) Group 2: 70 instances (October 1989) Group 3: 31 instances (February 1990) Group 4: 17 instances (April 1990) Group 5: 48 instances (August 1990) Group 6: 49 instances (Updated January 1991) Group 7: 31 instances (June 1991) Group 8: 86 instances (November 1991) ----------------------------------------- Total: 699 points (as of the donated datbase on 15 July 1992) Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances. This is because it originally contained 369 instances; 2 were removed. The following statements summarizes changes to the original Group 1's set of data:

##### Group 1 : 367 points: 200B 167M (January 1989) ##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805 ##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record ##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial ##### : Changed 0 to 1 in field 6 of sample 1219406 ##### : Changed 0 to 1 in field 8 of following sample: ##### : 1182404,2,3,1,1,1,2,0,1,1,1 5. Number of Instances: 699 (as of 15 July 1992) 6. Number of Attributes: 10 plus the class attribute 7. Attribute Information: (class attribute has been moved to last column) # Attribute Domain -- ----------------------------------------- 1. Sample code number id number 2. Clump Thickness 1-10 3. Uniformity of Cell Size 1-10 4. Uniformity of Cell Shape 1-10 5. Marginal Adhesion 1-10 6. Single Epithelial Cell Size 1-10 7. Bare Nuclei 1-10 8. Bland Chromatin 1-10 9. Normal Nucleoli 1-10 10. Mitoses 1-10 11. Class: (2 for benign, 4 for malignant) 8. Missing attribute values: 16 There are 16 instances in Groups 1 to 6 that contain a single missing (i.e., unavailable) attribute value, now denoted by "?". 9. Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)

Citation Request: This breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. If you publish results when using this database, then please include this information in your acknowledgements. Also, please cite one or more of: 1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18. 2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196. 3. O. L. Mangasarian, R. Setiono, and W. H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30. 4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers). Please refer to the repository http://archive.ics.uci.edu/ml (see citation policy). See also Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Descriptive statistics: Dataset= breast-cancer-wisconsin : n= 699, d= 9

Class1: n= 458 Covariance matrix: [1,] 2.8033 0.4431 0.5052 0.4470 0.2722 0.2176 0.2254 0.3935-0.0322 [2,] 0.4431 0.8239 0.6432 0.3406 0.3897 0.4413 0.3020 0.5465 0.0187 [3,] 0.5052 0.6432 0.9957 0.3129 0.3602 0.3603 0.2574 0.4771-0.0019 [4,] 0.4470 0.3406 0.3129 0.9937 0.3456 0.3151 0.2106 0.4081 0.0272 [5,] 0.2722 0.3897 0.3602 0.3456 0.8411 0.2726 0.2002 0.4968-0.0076 [6,] 0.2176 0.4413 0.3603 0.3151 0.2726 1.5031 0.2471 0.2902 0.0552 [7,] 0.2254 0.3020 0.2574 0.2106 0.2002 0.2471 1.1671 0.4412-0.0239 [8,] 0.3935 0.5465 0.4771 0.4081 0.4968 0.2902 0.4412 1.1212 0.0253 [9,] -0.0322 0.0187-0.0019 0.0272-0.0076 0.0552-0.0239 0.0253 0.2520 Correlation matrix: [1,] 1.0000 0.2916 0.3024 0.2678 0.1773 0.1060 0.1246 0.2219-0.0384 [2,] 0.2916 1.0000 0.7102 0.3765 0.4682 0.3966 0.3080 0.5686 0.0411 [3,] 0.3024 0.7102 1.0000 0.3145 0.3936 0.2945 0.2387 0.4516-0.0037 [4,] 0.2678 0.3765 0.3145 1.0000 0.3780 0.2579 0.1955 0.3866 0.0543 [5,] 0.1773 0.4682 0.3936 0.3780 1.0000 0.2425 0.2020 0.5116-0.0166 [6,] 0.1060 0.3966 0.2945 0.2579 0.2425 1.0000 0.1866 0.2236 0.0897 [7,] 0.1246 0.3080 0.2387 0.1955 0.2020 0.1866 1.0000 0.3857-0.0440 [8,] 0.2219 0.5686 0.4516 0.3866 0.5116 0.2236 0.3857 1.0000 0.0477 [9,] -0.0384 0.0411-0.0037 0.0543-0.0166 0.0897-0.0440 0.0477 1.0000 Median: 3 1 1 1 2 2 2 1 1 Mean: 2.9563 1.3253 1.4432 1.3646 2.1201 2.3712 2.1004 1.2904 1.0633 MCD-estimated: MDC-0.975-Mean: 2.7605 1 1.2158 1.2395 1.9974 2.2395 1.9868 1.1053 1.0526 MDC-0.750-Mean: 2.7605 1 1.2158 1.2395 1.9974 2.2395 1.9868 1.1053 1.0526 MDC-0.500-Mean: 2.7605 1 1.2158 1.2395 1.9974 2.2395 1.9868 1.1053 1.0526

Class2: n= 241 Covariance matrix: [1,] 5.8993 0.6379 0.6986-1.1448 0.0790-0.4450-0.0959-0.1024 0.7221 [2,] 0.6379 7.3957 5.0279 2.7892 3.0574-0.3889 2.3869 2.7621 1.6779 [3,] 0.6986 5.0279 6.5641 2.1794 2.3861-0.2429 1.9617 2.6854 1.3727 [4,] -1.1448 2.7892 2.1794 10.3071 1.6190-1.5845 2.4239 1.9378 1.6968 [5,] 0.0790 3.0574 2.3861 1.6190 6.0104-0.1446 1.1812 1.8577 2.1149 [6,] -0.4450-0.3889-0.2429-1.5845-0.1446 7.1115-0.6234 0.3222 0.0748 [7,] -0.0959 2.3869 1.9617 2.4239 1.1812-0.6234 5.1704 1.9096 0.3373 [8,] -0.1024 2.7621 2.6854 1.9378 1.8577 0.3222 1.9096 11.2270 1.8852 [9,] 0.7221 1.6779 1.3727 1.6968 2.1149 0.0748 0.3373 1.8852 6.5430 Correlation matrix: [1,] 1.0000 0.0966 0.1123-0.1468 0.0133-0.0687-0.0174-0.0126 0.1162 [2,] 0.0966 1.0000 0.7216 0.3195 0.4586-0.0536 0.3860 0.3031 0.2412 [3,] 0.1123 0.7216 1.0000 0.2650 0.3799-0.0356 0.3367 0.3128 0.2095 [4,] -0.1468 0.3195 0.2650 1.0000 0.2057-0.1851 0.3320 0.1801 0.2066 [5,] 0.0133 0.4586 0.3799 0.2057 1.0000-0.0221 0.2119 0.2262 0.3372 [6,] -0.0687-0.0536-0.0356-0.1851-0.0221 1.0000-0.1028 0.0361 0.0110 [7,] -0.0174 0.3860 0.3367 0.3320 0.2119-0.1028 1.0000 0.2506 0.0580 [8,] -0.0126 0.3031 0.3128 0.1801 0.2262 0.0361 0.2506 1.0000 0.2200 [9,] 0.1162 0.2412 0.2095 0.2066 0.3372 0.0110 0.0580 0.2200 1.0000 Median: 7.2026 6.5243 6.5108 5.4833 5.1199 4.5161 6.0056 5.9256 2.3489 Mean: 7.195 6.5726 6.5602 5.5477 5.2988 4.6763 5.9793 5.8631 2.5892 MCD-estimated: MDC-0.975-Mean: 7.2171 6.5814 6.6899 6.3333 5.3101 3 6.3023 5.6047 2.5349 MDC-0.750-Mean: 7.2171 6.5814 6.6899 6.3333 5.3101 3 6.3023 5.6047 2.5349 MDC-0.500-Mean: 7.2171 6.5814 6.6899 6.3333 5.3101 3 6.3023 5.6047 2.5349 Measures: Mah.Dist: 4.2448

Mah.Dist-MCD-0.975: 4.0757 Mah.Dist-MCD-0.750: 4.0757 Mah.Dist-MCD-0.500: 4.0757