APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION. A Thesis. Presented to the.

Size: px
Start display at page:

Download "APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION. A Thesis. Presented to the."

Transcription

1 APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Haofei Fang Summer 212

2

3 iii Copyright 212 by Haofei Fang All Rights Reserved

4 iv DEDICATION This thesis is dedicated to my dear fiancé, who supported me when I was in perplexity during the thesis preparation. She is one of the great powers pushing me to go forward. Also it is dedicated to my parents, who taught me the way to face difficulties. Their support encourages me to find my way to success.

5 v ABSTRACT OF THE THESIS Application of Population-Based Technology in Selection of Glycan Markers for Cancer Detection by Haofei Fang Master of Science in Computer Science San Diego State University, 212 Recent advances in computer technology and in molecular biology have greatly influenced and promoted the field of bioinformatics. Parts of these advances are new high throughput platforms for biomarker discovery and new algorithms for feature selection and classification. This thesis is dedicated to a class of feature selection and classification algorithms that are based on a new paradigm of artificial intelligence and pattern recognition known as swarm intelligence. A particular algorithm considered is Ant Colony Optimization (ACO) which is applied to a recently emerged biomarker platform based on printed glycan arrays (PGA). The thesis proposes an implementation of the ACO which is specially tuned for diagnosis of cancer using PGA data. The implementation is evaluated on real clinical data obtained from the School of Medicine of NYU, which contain 65 control samples of highrisk subjects exposed to asbestos and 5 subjects diagnosed with malignant mesothelioma. The results are compared to artificially generated data which have general characteristics similar to the original real data.

6 vi TABLE OF CONTENTS PAGE ABSTRACT...v LIST OF TABLES... ix LIST OF FIGURES...x ACKNOWLEDGEMENTS... xiii CHAPTER 1 INTRODUCTION MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY Mesothelioma Study, Demographics and Goals Printed Glycan Array General Information Structure of Data for MATLAB Data Preprocessing Normalization Quantile Normalization Intra -Slide Normalization Inter-Slide Normalization Transformation Transformation Necessity Implementation and Result FEATURE SELECTION AND CLASSIFICATION Univariate Feature Selection Multivariate Feature Selection Forward Sequential Feature Selection (FWD) Recursive Feature Elimination (RFE) Genetic Algorithm (GA) Ant Colony Optimization (ACO) Classification and Regression Trees (C&RT/C4.5)...23

7 vii Random Forest Trees (RF) Classifiers Multiple Logistic Regression (MLR) Generalized Linear Model (GLM) Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Naive Bayes/Mahalanobis Distance K-Nearest Neighbor (KNN) Classifier Performance Measures Accuracy Area Under the ROC Curve (AUC) Cross Validation Leave-One-Out Cross Validation (LOOCV) K-fold Cross Validation Hold-Out Cross Validation ANT COLONY OPTIMIZATION ALGORITHM Theory and the Algorithm Implementation Optimization Objective M-Files Step 1: Initialization Step 2: Population Step 3: Evaluation Step 4: Deposition Step 5: Preparation for New Iteration Optional Step: Randomization Empirical Tuning of ACO Parameters Number of Ants Stopping Criteria Application of ACO to Artificial Data Generation of Artificial Data Contaminated Artificial Data...54

8 4.4.3 Results Application of ACO to Mesothelioma Study Experiment Design Results COMPARISON OF ACO WITH OTHER APPROACHES IN CLASSIFICATION Genetic AUC Optimizer (GAUC) Experiment Design Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data Results Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data CONCLUSION...91 REFERENCES...93 viii

9 ix LIST OF TABLES PAGE Table 3.1. WMW Rank Calculation Demonstration Table 3.2. Result of Applying WMW to Mesothelioma Data Set...18 Table 4.1. Parameters to be Initialized for ACO...41 Table 4.2. Summary of Number of Ants Tuning (m = 4)...5 Table 4.3. Summary of Number of Ants Tuning Verification (m = 6)...5 Table 4.4. OCI-GID Reference Table...54 Table 4.5. Contamination Parameters for Artificial Data...56 Table 4.6. Selected Values of Parameters for Contamination...56 Table 4.7. The Performance of WMW on Artificial Data without/with Contamination...59 Table 4.8. Performance of WMW and ACO on Artificial Data...59 Table 4.9. Performance of WMW and ACO Applied on Artificial Data - Bootstrap...65 Table 4.1. Result of Comparing ACO Repeats and WMW on Mesothelioma Data Set...7 Table Result of Comparing ACO Repeats and WMW on Subsampled Mesothelioma Data Set...73 Table 5.1. Execution Time for Each Combination...77 Table 5.2. Average AUC Values for Each Combination...77 Table 5.3. Results of Stability Experiment...86 Table 5.4. Cross Validation on Mesothelioma Data Best AUC Values...87 Table 5.5. Cross Validation on Mesothelioma Data Features at Best AUC Values...87 Table 5.6. Cross Validation on Mesothelioma Data Best Stability...88 Table 5.7. Cross Validation on Mesothelioma Data Features at Best Stability...88 Table 5.8. Compare ACO and WMW on Raw Mesothelioma Data and Normalized Data...88 Table 5.9. Compare ACO and WMW on Contaminated Mesothelioma Datasets Repeated Training...89 Table 5.1. Compare ACO and WMW on Contaminated Mesothelioma Datasets Bootstrap...9

10 x LIST OF FIGURES PAGE Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB....7 Figure 2.2. Diagrammatic explanation of quantile normalization of training and test data Figure 2.3. Raw data and transformed data with different lambda using Box-Cox transformation Figure 3.1. Best features distribution plot for mesothelioma data set Figure 3.2. Plotting GA Fitness (Best and Average Values) Figure 3.3. Maximum-margin hyperplane and margins for an SVM trained with samples from two classes Figure 3.4. ROC curve space Figure 4.1. The flow chart of ACO for feature selection Figure 4.2. Flow chart of moving ants for ACO Figure 4.3. Flow chart for solution evaluation and pheromone table updating Figure 4.4. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.5. Plot of ACO performance, using 25 ants to select 4 features in 1 iterations Figure 4.6. Plot of ACO performance, using 5 ants to select 4 features in 1 iterations Figure 4.7. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.8. Plot of ACO performance, using 2 ants to select 4 features in 1 iterations Figure 4.9. Plot for ACO stopping criteria analysis demonstrating the trend of the ACO performance with iteration increasing Figure 4.1. Distribution of the best AUC values in the 1 repeats of ACO function without/with stopping criteria Figure Plots of patients distributions for the best features of artificial data without noise contamination Figure Plots of patients distributions for the best features of artificial data contamination level: Tiny....57

11 Figure Plots of patients distributions for the best features of artificial data with medium contamination level Figure Plots of patients distribution for the best features of artificial data with high contamination level Figure Histogram of selected features obtained by repeated ACO applied to artificial data without contamination....6 Figure Histogram of AUC values obtained by repeated ACO applied to artificial data without contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with mediun contamination Figure 4.2. Histogram of AUC values obtained by repeated ACO applied to artificial data with medium contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with heavy contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with heavy contamination Figure Repeated ACO applied to re-sampled artificial data without contamination Figure Repeated ACO applied to re-sampled artificial data with tiny contamination Figure Repeated ACO applied to on re-sampled artificial data with medium contamination Figure Repeated ACO applied to on re-sampled artificial data with heavy contamination Figure Repeated WMW applied to on re-sampled original artificial data Figure Repeated WMW applied to on re-sampled artificial data with tiny contamination Figure Repeated WMW applied to on re-sampled artificial data with medium contamination Figure 4.3. Repeated WMW applied to on re-sampled artificial data with heavy contamination Figure Histogram of repeated ACO on original mesothelioma data....7 xi

12 Figure Histogram of AUC values obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of AUC values obtained by repeated WMW applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated WMW applied to subsampled mesothelioma data Figure 5.1. The fitness progress of GAUC on mesothelioma data Figure 5.2. Histogram for selected features in stability experiment ACO-GLM Figure 5.3. Histogram for selected features in stability experiment ACO-SVM Figure 5.4. Histogram for selected features in stability experiment ACO-GA Figure 5.5. Histogram for selected features in Stability Experiment ACO FLD Figure 5.6. Histogram for selected features in stability experiment WMW GLM....8 Figure 5.7. Histogram for selected features in stability experiment WMW SVM....8 Figure 5.8. Histogram for selected features in stability experiment WMW GA Figure 5.9. Histogram for selected features in stability experiment WMW FLD Figure 5.1. Histogram for selected features in stability experiment GA GLM Figure 5.11 Histogram for selected features in stability experiment GA SVM Figure Histogram for selected features in stability experiment GA GA Figure Histogram for selected features in stability experiment GA - FLD Figure Histogram for selected features in stability experiment FWD GLM Figure Histogram for selected features in stability experiment FWD SVM Figure Histogram for selected features in stability experiment FWD GA Figure Histogram for selected features in stability experiment FWD FLD Figure Cross validation results on contaminated mesothelioma dataset....9 xii

13 xiii ACKNOWLEDGEMENTS Dr. Marko Vuskovic has been the ideal thesis supervisor. His sage advice, patient encouragement as well as cogent criticisms aided the writing of the thesis. I would also like to thank Dr. Joseph Lewis whose suggestions to this study were greatly needed.

14 1 CHAPTER 1 INTRODUCTION With the development of computer capabilities and deployment of advanced algorithms, biomarker discovery is becoming an important topic in bioinformatics applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. The stability with respect to sampling variation or robustness of such selection processes has received attention recently. Robustness of bio-markers is an important issue, as it may greatly influence subsequent biological validations. Besides the process of feature selection, classification plays an important role in the procedure of bio-marker s discovery as well. It is usually used as performance evaluation based on the result from feature selection. Numbers of methods could be involved in this process, including logistic regression, fisher linear discriminant, support vector machine and many others. Recently, the Ant Colony Optimization and Genetic Algorithm are introduced to implement the classification. The investigators at the Glycomic Laboratory of the NYU, School of Medicine [1] are evaluating a novel means of detecting mesothelioma and lung cancer early through what could ultimately be a simple blood test. They have developed a unique cancer diagnostic approach that utilizes a printed glycan array (PGA). This new high-throughput platform contains 286 carbohydrate molecules (glycans) that are often expressed on the surfaces of human cells, including abnormal sugars produced by lung cancer cells in response to changes induced by the cancer process. Researchers can measure antibodies against these abnormal glycans in the blood of people with mesothelioma or lung adenocarcinoma or those at risk for these diseases. This test could also be a tool for identifying new therapeutic targets. The scientists are developing this array as a global way of looking at molecules that may serve as very early markers to indicate that something is wrong inside lung or mesothelium cells. This information could be used to determine if someone is at risk for the mesothelioma or lung cancer or if someone who already has the disease is likely to do poorly and may need more aggressive therapy.

15 2 One of the basic problems in bioinformatics is that biomarker platforms deal with generally large number of features that can range from hundreds to thousands. Most of them are non-informative and ineffective in discrimination of patients as control or case group. Thus, the feature selection is used to select the most relevant glycans, or remove the noisy ones. There are several feature selection algorithms available nowadays. Basically, they can be divided into two groups, univariate and multivariate feature selection algorithms. Univariate methods treat existing candidate features individually. The performance of each feature in discrimination is evaluated separately. All features are then ranked by their performance and the top features would be used to train the classifier. In multivariate methods, features are treated as a group of dependent variables. Many algorithms for multivariate feature selection are developed, such as Recursive Feature Accumulation (RFA), Recursive Feature Elimination (RFE) and sequential forward/backward feature selections. These algorithms are developed as a compromise to global optimization which in case of large number of features becomes infeasible. There are, however, heuristic algorithms for feature selection which perform nearly real global optimization, such as Genetic Algorithm and Ant Colony Optimization. The latter will be the focus of this thesis. Ant Colony Optimization was initially proposed by Marco Dorigo in 1992 in his Ph.D Thesis [2]. It is a probabilistic technique for solving computational problems which can be reduced to finding a good path through a graph. By moving on the map from data model, ants can communicate with each other to transform information of the goal. ACO in this research is used to find an optimal subset from candidate features. The goal of this thesis is to apply the ideas of ACO to the diagnosis of cancer diseases based on data obtained from PGA. The study includes implementation of ACO based algorithms, analysis of performance and tuning of algorithmic parameters, and demonstration of the application of the developed software on the diagnosis of mesothelioma and lung cancer. The implementation of ACO includes computation of an important classification performance measure called area under the Receiver Operating Characteristic Curve (AUC), directly as opposed to computation of AUC after feature selection and projection. By comparing the results of application of ACO and other F/S method on PGA data, this study provides a better view on this new approach in cancer detection. Although ACO

16 3 doesn t achieve the best performance among other methods, it performs well with noisy data, when other algorithms fail. The material in this thesis is organized as follows: In Chapter 1, we introduce general concepts of technologies used in the research and the organization of this thesis. In Chapter 2, we introduce details about the PGA data for mesothelioma study. In Chapter 3, we discuss other feature selection and classification, including univariate and multivariate feature selection algorithms, classification models and the methods for classifier evaluation. In Chapter 4, we discuss the implementation of ACO including the parameter tuning and evaluation of ACO with both, artificial data and real mesothelioma data. Chapter 5 describes the experiments designed to evaluate the performance of different feature selection methods, combined with different classification algorithms. Chapter 6 presents the conclusion from experiments and discusses a possible future work which is enabled by the research in this thesis.

17 4 CHAPTER 2 MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY 2.1 MESOTHELIOMA STUDY, DEMOGRAPHICS AND GOALS Mesothelioma, more precisely malignant mesothelioma (MM), is a rare form of cancer that develops in the protective lining that covers many of the body s internal organs, the mesothelium. It is usually caused by exposure to asbestos [3]. Its most common site is the pleura (outer lining of the lungs and internal chest wall), but it may also occur in the peritoneum (the lining of the abdominal cavity), the heart, the pericardium (a sac that surrounds the heart) [4] or tunica vaginalis. Most people who develop mesothelioma have worked on jobs where they inhaled asbestos and glass particles, or they have been exposed to asbestos dust and fiber in other ways. Unlike lung cancer, there is no association between mesothelioma and smoking, but smoking greatly increases the risk of other asbestos-related cancers [5]. Those who have been exposed to asbestos often utilize attorneys to collect damages for asbestos-related disease, including mesothelioma. Compensation via asbestos funds or lawsuits is an important issue in mesothelioma. The symptoms of mesothelioma include shortness of breath due to pleural effusion (fluid between the lung and the chest wall) or chest wall pain, and general symptoms such as weight loss. The diagnosis may be suspected with chest X-ray or CT scan, and is confirmed with a biopsy (tissue sample) and microscopic examination. Diagnosing mesothelioma is often difficult, because the symptoms are similar to those of a number of other conditions. Diagnosis begins with a review of the patient s medical history. A history of exposure to asbestos may increase clinical suspicion for mesothelioma. A physical examination is performed, followed by chest X-rays and often lung function test.

18 5 The life expectancy for mesothelioma patients is generally reported as less than one year following diagnosis. However, a patient s prognosis is affected by several factors, including how early the cancer is diagnosed and how aggressively it is treated. If a problem is suspected, a physician may request several diagnostic tests. These typically include medical imaging techniques such as: X-rays; CT scans; PET scans; MRI scans. A combination of these tests is often used to determine the location, size and type of cancer. Biopsy procedures are often requested following an imaging scan to test samples of fluid and tissue for the presence of cancerous cell. In this research we will demonstrate early detection and/or diagnosis of malignant mesothelioma based on Printed Glycan Arrays (PGAs). The Mesothelioma study [6] will include 65 subjects exposed to asbestos, but not diagnosed with MM, and 5 patients diagnosed with MM. The data were obtained from serum collected by Prof. Harvey Pass, MD in the School of medicine at NYU, and developed on PGAs at Cellexicon, Inc., La Jolla, CA. The data and related results were part of the NIH-NCI grant [7] and are published in several publications, including [8] and [6, 9]. In the following sections we will describe the PGAs and their functionality and various data preprocessing algorithms which are used before ACO-based feature selection and classification. 2.2 PRINTED GLYCAN ARRAY In medicine, a biomarker can be a traceable substance that is introduced into an organism as a means to examine organ function or other aspects of health. It can also be a substance whose detection indicates a particular disease state. For example, the presence of an antibody may indicate an infection. More specifically in this research, a biomarker, glycan, indicates a change in expression or state of the immune system that correlates with the risk or progression of mesothelioma, or with the susceptibility of the disease to a given treatment. Biochemical biomarkers are often used in clinical trials, where they are derived from bodily fluids that are easily available to the early phase researchers.

19 General Information In the last five years, a new biomarker-discovery platform has emerged based on glycan arrays [9], which has some advantages over nucleic acid-based and other platforms. The printed glycan arrays are similar to DNA microarrays, but contain deposits of various carbohydrate structures (glycans) instead of spotted DNAs. Most of these glycans can be found on the surfaces of normal human cells, human cancer cells, and on the surfaces of many human infectious agents such as bacteria, viruses, and other pathogenic microorganisms. Transformation of cells from healthy to pre-malignant and malignant is associated with the appearance of abnormal glycosylation on proteins and lipids presented on the surface of these cells. The malignancy-related abnormal glycans are called tumorassociated carbohydrate antigens (TACA). There is growing evidence that numerous TACAs are immunogenic, and that the human immune system can generate antibodies against them. Since multiple glycans arrayed on PGAs are either known TACAs or closely related structures, the antibodies present in human sera that bind to glycans on PGAs can indicate the status of response of the immune system to human malignancies. A printed glycan array (PGA) consists of a glass slide coated with a chemically reactive surface on which various glycans are covalently attached using standard aminocoupling chemistry and contact printing technology. A PGA slide contains several sub-arrays of the entire currently available glycan library in the form of microscopic glycan deposits of size about 8 microns that are identical duplicates. For each slide, the data from each subarray will be processed as the raw data to which we are going to apply processing and classification algorithms. The advantage of a potential PGA-based serum test [9] for early detection of cancer and cancer risk can be summarized as follows: (a) minimal invasiveness of serum sampling; (b) minimal sampling variability, in contrast to well-known heterogeneity of solid tissue samples; (c) stability of antibodies, (d) low cost associated with technology; (e) low labor intensity and short duration of the test; (f) broad scope of the test, i.e. the test doesn t have to be narrowly targeted to a particular disease, e.g. cancer type. All these advantages make the PGA platform attractive for early detection of disease and for the potential application in screening of the general population.

20 Generally, there are five steps introduced to achieve the PGA data: printing of glycan arrays, development of arrays with serum samples, scanning, quantification and data aggregation. After all these steps, we can form a data structure based on the quantified PGA data. The detail of the data structure is discussed in next section. Due to the relatively moderate discriminatory power of individual glycans of PGA arrays, see Chapter 4, we can see the necessity of applying such an ant-based feature selection algorithm to find an optimal combination of several biomarkers for classification, modeling and other purposes. The results comparing the discriminatory power of individual glycans and combination of glycans would be discussed in Chapter Structure of Data for MATLAB We are working with data as a structure consisting mainly of a 2-dimension matrix, two row vectors, and two column vectors mainly and other auxiliary data (see Figure 2.1). One of the row vectors is called Original Column Index (OCI). The data in this vector denotes the original index of features in the matrix after data quantification and before we extract some of them from the matrix. These indices correspond to the order of glycans used in PGA library. A second row vector contains the Glycan Identification (GID) assigned to distinct features. Each GID is a unique three-digit number which denotes a specific glycan structure used in the array. One of the column vectors is the Patient Identification (PID). Each patient is assigned a unique ID for further exploration. A second row vector (y) contains binary labels, i.e. membership to control or case class for each patient. 7 OCI 1xd GID 1xd PID nx1 X nxd Y nx1 Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB.

21 8 Each data element of the matrix (X) represents a fluorescent intensity in relative units (FRU) associated with the binding of anti-glycan antibodies from a serum of a patient (rows) and glycans (columns). The most critical part in the above data structure is the 2-dimension matrix X. Since the sample data for different patients could be collected by different physicians and equipment at different locations and even in different years, the data might not be comparable. These differences could introduce biases and other unexpected impacts on the data, which would reduce the reliability of the glycans (features) and even cause incorrect classification. Therefore, an important step is necessary: the preprocessing of raw data, explained in the following section. 2.3 DATA PREPROCESSING In the real world, the data are generally incomplete, noisy and inconsistent [1]. They could be lacking attribute values, lacking certain attributes of interest and containing errors or outliers. Sometimes the data might contain discrepancies caused by the variety of equipment and environment. Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure [11]. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. There are a number of different tools and methods used for preprocessing. The most common tools are normalization and transformation. In this study, the ultimate goal is to extract most significant information from the PGAs to classify patients correctly. From this point of view, every step before the classification, including PGAs development, data normalization and transformation as well as feature selection, can be considered as preprocessing. Considering the main object in this study of evaluating the performance of the Ant-based algorithm in data feature selection, we will only consider normalization and transformation as preprocessing steps. In the following two sub-sections, we discuss normalization, which organizes data for more efficient access; and transformation, which manipulates raw data to produce a single input.

22 Normalization In one usage in statistics, normalization is the process of isolating statistical error in repeated measured data. In another usage, normalization refers to the division of multiple sets of data by a common variable in order to negate the variable s effect on the data, thus allowing underlying characteristics of the data set to be compared. This allows data on different scales to be compared, by bringing them to a common scale. For example, in this study a PGA image is developed with patient s serum and glycans on the glass slides, which are scanned by a laser scanner for quantification. The image for the patients could vary because of the equipment, location or even climate differences. To handle such a problem, normalization is necessary to ensure the data for patients are comparable so that the following steps, feature selection and classification, will achieve reliable results. In this particular study, we use three different normalization methods: Quantile Normalization, Intra-slide Normalization and Inter-slide Normalization. These three methods are going to be applied to the raw data. The resulting processed data will be applied for the further feature selection procedure QUANTILE NORMALIZATION The goal of quantile normalization is to ensure that the distribution of intensities across all variables (glycans) is the same for each patient [12]. The method is motivated by the idea that a quantile-quantile plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is other than a diagonal line. This concept is extended to n dimensions so that if all n data vectors have the same distribution, then plotting the quantiles in n dimentions gives a straight line. This suggests we could make a set of data have the same distribution if we project the point of our n dimensional quantile plot onto the diagonal. The critical part of this normalization is to find the reference distribution. Generally a reference distribution will be one of the standard statistical distributions such as the Gaussian distribution or the Poisson distribution. The reference distribution can be generated randomly or from taking regular samples from the cumulative distribution function of the distribution. However any reference distribution can be used.

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Mesothelioma: Questions and Answers

Mesothelioma: Questions and Answers CANCER FACTS N a t i o n a l C a n c e r I n s t i t u t e N a t i o n a l I n s t i t u t e s o f H e a l t h D e p a r t m e n t o f H e a l t h a n d H u m a n S e r v i c e s Mesothelioma: Questions

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Data Mining and Machine Learning in Bioinformatics

Data Mining and Machine Learning in Bioinformatics Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening , pp.169-178 http://dx.doi.org/10.14257/ijbsbt.2014.6.2.17 Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening Ki-Seok Cheong 2,3, Hye-Jeong Song 1,3, Chan-Young Park 1,3, Jong-Dae

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Mesothelioma. 1995-2013, The Patient Education Institute, Inc. www.x-plain.com ocft0101 Last reviewed: 03/21/2013 1

Mesothelioma. 1995-2013, The Patient Education Institute, Inc. www.x-plain.com ocft0101 Last reviewed: 03/21/2013 1 Mesothelioma Introduction Mesothelioma is a type of cancer. It starts in the tissue that lines your lungs, stomach, heart, and other organs. This tissue is called mesothelium. Most people who get this

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study. Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study Prepared by: Centers for Disease Control and Prevention National

More information

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India Volume 5, Issue 6, June 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Multiple Pheromone

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Disease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? www.simpsonmillar.co.uk Telephone 0844 858 3200

Disease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? www.simpsonmillar.co.uk Telephone 0844 858 3200 GUIDE TO ASBESTOS LUNG CANCER What Is Asbestos Lung Cancer? Like tobacco smoking, exposure to asbestos can result in the development of lung cancer. Similarly, the risk of developing asbestos induced lung

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Rulex s Logic Learning Machines successfully meet biomedical challenges.

Rulex s Logic Learning Machines successfully meet biomedical challenges. Rulex s Logic Learning Machines successfully meet biomedical challenges. Rulex is a predictive analytics platform able to manage and to analyze big amounts of heterogeneous data. With Rulex, it is possible,

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Identifying SPAM with Predictive Models

Identifying SPAM with Predictive Models Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Data Mining Analysis of HIV-1 Protease Crystal Structures

Data Mining Analysis of HIV-1 Protease Crystal Structures Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, and Rajni Garg AP0907 09 Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko 1, A.

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Exploring the Role of Vitamins in Achieving a Healthy Heart

Exploring the Role of Vitamins in Achieving a Healthy Heart Exploring the Role of Vitamins in Achieving a Healthy Heart There are many avenues you can take to keep your heart healthy. The first step you should take is to have a medical professional evaluate the

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm

More information

FREQUENTLY ASKED QUESTIONS about asbestos related diseases

FREQUENTLY ASKED QUESTIONS about asbestos related diseases FREQUENTLY ASKED QUESTIONS about asbestos related diseases 1. What are the main types of asbestos lung disease? In the human body, asbestos affects the lungs most of all. It can affect both the spongy

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information