APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION. A Thesis. Presented to the.
|
|
- Meagan Armstrong
- 8 years ago
- Views:
Transcription
1 APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Haofei Fang Summer 212
2
3 iii Copyright 212 by Haofei Fang All Rights Reserved
4 iv DEDICATION This thesis is dedicated to my dear fiancé, who supported me when I was in perplexity during the thesis preparation. She is one of the great powers pushing me to go forward. Also it is dedicated to my parents, who taught me the way to face difficulties. Their support encourages me to find my way to success.
5 v ABSTRACT OF THE THESIS Application of Population-Based Technology in Selection of Glycan Markers for Cancer Detection by Haofei Fang Master of Science in Computer Science San Diego State University, 212 Recent advances in computer technology and in molecular biology have greatly influenced and promoted the field of bioinformatics. Parts of these advances are new high throughput platforms for biomarker discovery and new algorithms for feature selection and classification. This thesis is dedicated to a class of feature selection and classification algorithms that are based on a new paradigm of artificial intelligence and pattern recognition known as swarm intelligence. A particular algorithm considered is Ant Colony Optimization (ACO) which is applied to a recently emerged biomarker platform based on printed glycan arrays (PGA). The thesis proposes an implementation of the ACO which is specially tuned for diagnosis of cancer using PGA data. The implementation is evaluated on real clinical data obtained from the School of Medicine of NYU, which contain 65 control samples of highrisk subjects exposed to asbestos and 5 subjects diagnosed with malignant mesothelioma. The results are compared to artificially generated data which have general characteristics similar to the original real data.
6 vi TABLE OF CONTENTS PAGE ABSTRACT...v LIST OF TABLES... ix LIST OF FIGURES...x ACKNOWLEDGEMENTS... xiii CHAPTER 1 INTRODUCTION MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY Mesothelioma Study, Demographics and Goals Printed Glycan Array General Information Structure of Data for MATLAB Data Preprocessing Normalization Quantile Normalization Intra -Slide Normalization Inter-Slide Normalization Transformation Transformation Necessity Implementation and Result FEATURE SELECTION AND CLASSIFICATION Univariate Feature Selection Multivariate Feature Selection Forward Sequential Feature Selection (FWD) Recursive Feature Elimination (RFE) Genetic Algorithm (GA) Ant Colony Optimization (ACO) Classification and Regression Trees (C&RT/C4.5)...23
7 vii Random Forest Trees (RF) Classifiers Multiple Logistic Regression (MLR) Generalized Linear Model (GLM) Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Naive Bayes/Mahalanobis Distance K-Nearest Neighbor (KNN) Classifier Performance Measures Accuracy Area Under the ROC Curve (AUC) Cross Validation Leave-One-Out Cross Validation (LOOCV) K-fold Cross Validation Hold-Out Cross Validation ANT COLONY OPTIMIZATION ALGORITHM Theory and the Algorithm Implementation Optimization Objective M-Files Step 1: Initialization Step 2: Population Step 3: Evaluation Step 4: Deposition Step 5: Preparation for New Iteration Optional Step: Randomization Empirical Tuning of ACO Parameters Number of Ants Stopping Criteria Application of ACO to Artificial Data Generation of Artificial Data Contaminated Artificial Data...54
8 4.4.3 Results Application of ACO to Mesothelioma Study Experiment Design Results COMPARISON OF ACO WITH OTHER APPROACHES IN CLASSIFICATION Genetic AUC Optimizer (GAUC) Experiment Design Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data Results Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data CONCLUSION...91 REFERENCES...93 viii
9 ix LIST OF TABLES PAGE Table 3.1. WMW Rank Calculation Demonstration Table 3.2. Result of Applying WMW to Mesothelioma Data Set...18 Table 4.1. Parameters to be Initialized for ACO...41 Table 4.2. Summary of Number of Ants Tuning (m = 4)...5 Table 4.3. Summary of Number of Ants Tuning Verification (m = 6)...5 Table 4.4. OCI-GID Reference Table...54 Table 4.5. Contamination Parameters for Artificial Data...56 Table 4.6. Selected Values of Parameters for Contamination...56 Table 4.7. The Performance of WMW on Artificial Data without/with Contamination...59 Table 4.8. Performance of WMW and ACO on Artificial Data...59 Table 4.9. Performance of WMW and ACO Applied on Artificial Data - Bootstrap...65 Table 4.1. Result of Comparing ACO Repeats and WMW on Mesothelioma Data Set...7 Table Result of Comparing ACO Repeats and WMW on Subsampled Mesothelioma Data Set...73 Table 5.1. Execution Time for Each Combination...77 Table 5.2. Average AUC Values for Each Combination...77 Table 5.3. Results of Stability Experiment...86 Table 5.4. Cross Validation on Mesothelioma Data Best AUC Values...87 Table 5.5. Cross Validation on Mesothelioma Data Features at Best AUC Values...87 Table 5.6. Cross Validation on Mesothelioma Data Best Stability...88 Table 5.7. Cross Validation on Mesothelioma Data Features at Best Stability...88 Table 5.8. Compare ACO and WMW on Raw Mesothelioma Data and Normalized Data...88 Table 5.9. Compare ACO and WMW on Contaminated Mesothelioma Datasets Repeated Training...89 Table 5.1. Compare ACO and WMW on Contaminated Mesothelioma Datasets Bootstrap...9
10 x LIST OF FIGURES PAGE Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB....7 Figure 2.2. Diagrammatic explanation of quantile normalization of training and test data Figure 2.3. Raw data and transformed data with different lambda using Box-Cox transformation Figure 3.1. Best features distribution plot for mesothelioma data set Figure 3.2. Plotting GA Fitness (Best and Average Values) Figure 3.3. Maximum-margin hyperplane and margins for an SVM trained with samples from two classes Figure 3.4. ROC curve space Figure 4.1. The flow chart of ACO for feature selection Figure 4.2. Flow chart of moving ants for ACO Figure 4.3. Flow chart for solution evaluation and pheromone table updating Figure 4.4. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.5. Plot of ACO performance, using 25 ants to select 4 features in 1 iterations Figure 4.6. Plot of ACO performance, using 5 ants to select 4 features in 1 iterations Figure 4.7. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.8. Plot of ACO performance, using 2 ants to select 4 features in 1 iterations Figure 4.9. Plot for ACO stopping criteria analysis demonstrating the trend of the ACO performance with iteration increasing Figure 4.1. Distribution of the best AUC values in the 1 repeats of ACO function without/with stopping criteria Figure Plots of patients distributions for the best features of artificial data without noise contamination Figure Plots of patients distributions for the best features of artificial data contamination level: Tiny....57
11 Figure Plots of patients distributions for the best features of artificial data with medium contamination level Figure Plots of patients distribution for the best features of artificial data with high contamination level Figure Histogram of selected features obtained by repeated ACO applied to artificial data without contamination....6 Figure Histogram of AUC values obtained by repeated ACO applied to artificial data without contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with mediun contamination Figure 4.2. Histogram of AUC values obtained by repeated ACO applied to artificial data with medium contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with heavy contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with heavy contamination Figure Repeated ACO applied to re-sampled artificial data without contamination Figure Repeated ACO applied to re-sampled artificial data with tiny contamination Figure Repeated ACO applied to on re-sampled artificial data with medium contamination Figure Repeated ACO applied to on re-sampled artificial data with heavy contamination Figure Repeated WMW applied to on re-sampled original artificial data Figure Repeated WMW applied to on re-sampled artificial data with tiny contamination Figure Repeated WMW applied to on re-sampled artificial data with medium contamination Figure 4.3. Repeated WMW applied to on re-sampled artificial data with heavy contamination Figure Histogram of repeated ACO on original mesothelioma data....7 xi
12 Figure Histogram of AUC values obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of AUC values obtained by repeated WMW applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated WMW applied to subsampled mesothelioma data Figure 5.1. The fitness progress of GAUC on mesothelioma data Figure 5.2. Histogram for selected features in stability experiment ACO-GLM Figure 5.3. Histogram for selected features in stability experiment ACO-SVM Figure 5.4. Histogram for selected features in stability experiment ACO-GA Figure 5.5. Histogram for selected features in Stability Experiment ACO FLD Figure 5.6. Histogram for selected features in stability experiment WMW GLM....8 Figure 5.7. Histogram for selected features in stability experiment WMW SVM....8 Figure 5.8. Histogram for selected features in stability experiment WMW GA Figure 5.9. Histogram for selected features in stability experiment WMW FLD Figure 5.1. Histogram for selected features in stability experiment GA GLM Figure 5.11 Histogram for selected features in stability experiment GA SVM Figure Histogram for selected features in stability experiment GA GA Figure Histogram for selected features in stability experiment GA - FLD Figure Histogram for selected features in stability experiment FWD GLM Figure Histogram for selected features in stability experiment FWD SVM Figure Histogram for selected features in stability experiment FWD GA Figure Histogram for selected features in stability experiment FWD FLD Figure Cross validation results on contaminated mesothelioma dataset....9 xii
13 xiii ACKNOWLEDGEMENTS Dr. Marko Vuskovic has been the ideal thesis supervisor. His sage advice, patient encouragement as well as cogent criticisms aided the writing of the thesis. I would also like to thank Dr. Joseph Lewis whose suggestions to this study were greatly needed.
14 1 CHAPTER 1 INTRODUCTION With the development of computer capabilities and deployment of advanced algorithms, biomarker discovery is becoming an important topic in bioinformatics applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. The stability with respect to sampling variation or robustness of such selection processes has received attention recently. Robustness of bio-markers is an important issue, as it may greatly influence subsequent biological validations. Besides the process of feature selection, classification plays an important role in the procedure of bio-marker s discovery as well. It is usually used as performance evaluation based on the result from feature selection. Numbers of methods could be involved in this process, including logistic regression, fisher linear discriminant, support vector machine and many others. Recently, the Ant Colony Optimization and Genetic Algorithm are introduced to implement the classification. The investigators at the Glycomic Laboratory of the NYU, School of Medicine [1] are evaluating a novel means of detecting mesothelioma and lung cancer early through what could ultimately be a simple blood test. They have developed a unique cancer diagnostic approach that utilizes a printed glycan array (PGA). This new high-throughput platform contains 286 carbohydrate molecules (glycans) that are often expressed on the surfaces of human cells, including abnormal sugars produced by lung cancer cells in response to changes induced by the cancer process. Researchers can measure antibodies against these abnormal glycans in the blood of people with mesothelioma or lung adenocarcinoma or those at risk for these diseases. This test could also be a tool for identifying new therapeutic targets. The scientists are developing this array as a global way of looking at molecules that may serve as very early markers to indicate that something is wrong inside lung or mesothelium cells. This information could be used to determine if someone is at risk for the mesothelioma or lung cancer or if someone who already has the disease is likely to do poorly and may need more aggressive therapy.
15 2 One of the basic problems in bioinformatics is that biomarker platforms deal with generally large number of features that can range from hundreds to thousands. Most of them are non-informative and ineffective in discrimination of patients as control or case group. Thus, the feature selection is used to select the most relevant glycans, or remove the noisy ones. There are several feature selection algorithms available nowadays. Basically, they can be divided into two groups, univariate and multivariate feature selection algorithms. Univariate methods treat existing candidate features individually. The performance of each feature in discrimination is evaluated separately. All features are then ranked by their performance and the top features would be used to train the classifier. In multivariate methods, features are treated as a group of dependent variables. Many algorithms for multivariate feature selection are developed, such as Recursive Feature Accumulation (RFA), Recursive Feature Elimination (RFE) and sequential forward/backward feature selections. These algorithms are developed as a compromise to global optimization which in case of large number of features becomes infeasible. There are, however, heuristic algorithms for feature selection which perform nearly real global optimization, such as Genetic Algorithm and Ant Colony Optimization. The latter will be the focus of this thesis. Ant Colony Optimization was initially proposed by Marco Dorigo in 1992 in his Ph.D Thesis [2]. It is a probabilistic technique for solving computational problems which can be reduced to finding a good path through a graph. By moving on the map from data model, ants can communicate with each other to transform information of the goal. ACO in this research is used to find an optimal subset from candidate features. The goal of this thesis is to apply the ideas of ACO to the diagnosis of cancer diseases based on data obtained from PGA. The study includes implementation of ACO based algorithms, analysis of performance and tuning of algorithmic parameters, and demonstration of the application of the developed software on the diagnosis of mesothelioma and lung cancer. The implementation of ACO includes computation of an important classification performance measure called area under the Receiver Operating Characteristic Curve (AUC), directly as opposed to computation of AUC after feature selection and projection. By comparing the results of application of ACO and other F/S method on PGA data, this study provides a better view on this new approach in cancer detection. Although ACO
16 3 doesn t achieve the best performance among other methods, it performs well with noisy data, when other algorithms fail. The material in this thesis is organized as follows: In Chapter 1, we introduce general concepts of technologies used in the research and the organization of this thesis. In Chapter 2, we introduce details about the PGA data for mesothelioma study. In Chapter 3, we discuss other feature selection and classification, including univariate and multivariate feature selection algorithms, classification models and the methods for classifier evaluation. In Chapter 4, we discuss the implementation of ACO including the parameter tuning and evaluation of ACO with both, artificial data and real mesothelioma data. Chapter 5 describes the experiments designed to evaluate the performance of different feature selection methods, combined with different classification algorithms. Chapter 6 presents the conclusion from experiments and discusses a possible future work which is enabled by the research in this thesis.
17 4 CHAPTER 2 MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY 2.1 MESOTHELIOMA STUDY, DEMOGRAPHICS AND GOALS Mesothelioma, more precisely malignant mesothelioma (MM), is a rare form of cancer that develops in the protective lining that covers many of the body s internal organs, the mesothelium. It is usually caused by exposure to asbestos [3]. Its most common site is the pleura (outer lining of the lungs and internal chest wall), but it may also occur in the peritoneum (the lining of the abdominal cavity), the heart, the pericardium (a sac that surrounds the heart) [4] or tunica vaginalis. Most people who develop mesothelioma have worked on jobs where they inhaled asbestos and glass particles, or they have been exposed to asbestos dust and fiber in other ways. Unlike lung cancer, there is no association between mesothelioma and smoking, but smoking greatly increases the risk of other asbestos-related cancers [5]. Those who have been exposed to asbestos often utilize attorneys to collect damages for asbestos-related disease, including mesothelioma. Compensation via asbestos funds or lawsuits is an important issue in mesothelioma. The symptoms of mesothelioma include shortness of breath due to pleural effusion (fluid between the lung and the chest wall) or chest wall pain, and general symptoms such as weight loss. The diagnosis may be suspected with chest X-ray or CT scan, and is confirmed with a biopsy (tissue sample) and microscopic examination. Diagnosing mesothelioma is often difficult, because the symptoms are similar to those of a number of other conditions. Diagnosis begins with a review of the patient s medical history. A history of exposure to asbestos may increase clinical suspicion for mesothelioma. A physical examination is performed, followed by chest X-rays and often lung function test.
18 5 The life expectancy for mesothelioma patients is generally reported as less than one year following diagnosis. However, a patient s prognosis is affected by several factors, including how early the cancer is diagnosed and how aggressively it is treated. If a problem is suspected, a physician may request several diagnostic tests. These typically include medical imaging techniques such as: X-rays; CT scans; PET scans; MRI scans. A combination of these tests is often used to determine the location, size and type of cancer. Biopsy procedures are often requested following an imaging scan to test samples of fluid and tissue for the presence of cancerous cell. In this research we will demonstrate early detection and/or diagnosis of malignant mesothelioma based on Printed Glycan Arrays (PGAs). The Mesothelioma study [6] will include 65 subjects exposed to asbestos, but not diagnosed with MM, and 5 patients diagnosed with MM. The data were obtained from serum collected by Prof. Harvey Pass, MD in the School of medicine at NYU, and developed on PGAs at Cellexicon, Inc., La Jolla, CA. The data and related results were part of the NIH-NCI grant [7] and are published in several publications, including [8] and [6, 9]. In the following sections we will describe the PGAs and their functionality and various data preprocessing algorithms which are used before ACO-based feature selection and classification. 2.2 PRINTED GLYCAN ARRAY In medicine, a biomarker can be a traceable substance that is introduced into an organism as a means to examine organ function or other aspects of health. It can also be a substance whose detection indicates a particular disease state. For example, the presence of an antibody may indicate an infection. More specifically in this research, a biomarker, glycan, indicates a change in expression or state of the immune system that correlates with the risk or progression of mesothelioma, or with the susceptibility of the disease to a given treatment. Biochemical biomarkers are often used in clinical trials, where they are derived from bodily fluids that are easily available to the early phase researchers.
19 General Information In the last five years, a new biomarker-discovery platform has emerged based on glycan arrays [9], which has some advantages over nucleic acid-based and other platforms. The printed glycan arrays are similar to DNA microarrays, but contain deposits of various carbohydrate structures (glycans) instead of spotted DNAs. Most of these glycans can be found on the surfaces of normal human cells, human cancer cells, and on the surfaces of many human infectious agents such as bacteria, viruses, and other pathogenic microorganisms. Transformation of cells from healthy to pre-malignant and malignant is associated with the appearance of abnormal glycosylation on proteins and lipids presented on the surface of these cells. The malignancy-related abnormal glycans are called tumorassociated carbohydrate antigens (TACA). There is growing evidence that numerous TACAs are immunogenic, and that the human immune system can generate antibodies against them. Since multiple glycans arrayed on PGAs are either known TACAs or closely related structures, the antibodies present in human sera that bind to glycans on PGAs can indicate the status of response of the immune system to human malignancies. A printed glycan array (PGA) consists of a glass slide coated with a chemically reactive surface on which various glycans are covalently attached using standard aminocoupling chemistry and contact printing technology. A PGA slide contains several sub-arrays of the entire currently available glycan library in the form of microscopic glycan deposits of size about 8 microns that are identical duplicates. For each slide, the data from each subarray will be processed as the raw data to which we are going to apply processing and classification algorithms. The advantage of a potential PGA-based serum test [9] for early detection of cancer and cancer risk can be summarized as follows: (a) minimal invasiveness of serum sampling; (b) minimal sampling variability, in contrast to well-known heterogeneity of solid tissue samples; (c) stability of antibodies, (d) low cost associated with technology; (e) low labor intensity and short duration of the test; (f) broad scope of the test, i.e. the test doesn t have to be narrowly targeted to a particular disease, e.g. cancer type. All these advantages make the PGA platform attractive for early detection of disease and for the potential application in screening of the general population.
20 Generally, there are five steps introduced to achieve the PGA data: printing of glycan arrays, development of arrays with serum samples, scanning, quantification and data aggregation. After all these steps, we can form a data structure based on the quantified PGA data. The detail of the data structure is discussed in next section. Due to the relatively moderate discriminatory power of individual glycans of PGA arrays, see Chapter 4, we can see the necessity of applying such an ant-based feature selection algorithm to find an optimal combination of several biomarkers for classification, modeling and other purposes. The results comparing the discriminatory power of individual glycans and combination of glycans would be discussed in Chapter Structure of Data for MATLAB We are working with data as a structure consisting mainly of a 2-dimension matrix, two row vectors, and two column vectors mainly and other auxiliary data (see Figure 2.1). One of the row vectors is called Original Column Index (OCI). The data in this vector denotes the original index of features in the matrix after data quantification and before we extract some of them from the matrix. These indices correspond to the order of glycans used in PGA library. A second row vector contains the Glycan Identification (GID) assigned to distinct features. Each GID is a unique three-digit number which denotes a specific glycan structure used in the array. One of the column vectors is the Patient Identification (PID). Each patient is assigned a unique ID for further exploration. A second row vector (y) contains binary labels, i.e. membership to control or case class for each patient. 7 OCI 1xd GID 1xd PID nx1 X nxd Y nx1 Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB.
21 8 Each data element of the matrix (X) represents a fluorescent intensity in relative units (FRU) associated with the binding of anti-glycan antibodies from a serum of a patient (rows) and glycans (columns). The most critical part in the above data structure is the 2-dimension matrix X. Since the sample data for different patients could be collected by different physicians and equipment at different locations and even in different years, the data might not be comparable. These differences could introduce biases and other unexpected impacts on the data, which would reduce the reliability of the glycans (features) and even cause incorrect classification. Therefore, an important step is necessary: the preprocessing of raw data, explained in the following section. 2.3 DATA PREPROCESSING In the real world, the data are generally incomplete, noisy and inconsistent [1]. They could be lacking attribute values, lacking certain attributes of interest and containing errors or outliers. Sometimes the data might contain discrepancies caused by the variety of equipment and environment. Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure [11]. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. There are a number of different tools and methods used for preprocessing. The most common tools are normalization and transformation. In this study, the ultimate goal is to extract most significant information from the PGAs to classify patients correctly. From this point of view, every step before the classification, including PGAs development, data normalization and transformation as well as feature selection, can be considered as preprocessing. Considering the main object in this study of evaluating the performance of the Ant-based algorithm in data feature selection, we will only consider normalization and transformation as preprocessing steps. In the following two sub-sections, we discuss normalization, which organizes data for more efficient access; and transformation, which manipulates raw data to produce a single input.
22 Normalization In one usage in statistics, normalization is the process of isolating statistical error in repeated measured data. In another usage, normalization refers to the division of multiple sets of data by a common variable in order to negate the variable s effect on the data, thus allowing underlying characteristics of the data set to be compared. This allows data on different scales to be compared, by bringing them to a common scale. For example, in this study a PGA image is developed with patient s serum and glycans on the glass slides, which are scanned by a laser scanner for quantification. The image for the patients could vary because of the equipment, location or even climate differences. To handle such a problem, normalization is necessary to ensure the data for patients are comparable so that the following steps, feature selection and classification, will achieve reliable results. In this particular study, we use three different normalization methods: Quantile Normalization, Intra-slide Normalization and Inter-slide Normalization. These three methods are going to be applied to the raw data. The resulting processed data will be applied for the further feature selection procedure QUANTILE NORMALIZATION The goal of quantile normalization is to ensure that the distribution of intensities across all variables (glycans) is the same for each patient [12]. The method is motivated by the idea that a quantile-quantile plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is other than a diagonal line. This concept is extended to n dimensions so that if all n data vectors have the same distribution, then plotting the quantiles in n dimentions gives a straight line. This suggests we could make a set of data have the same distribution if we project the point of our n dimensional quantile plot onto the diagonal. The critical part of this normalization is to find the reference distribution. Generally a reference distribution will be one of the standard statistical distributions such as the Gaussian distribution or the Poisson distribution. The reference distribution can be generated randomly or from taking regular samples from the cumulative distribution function of the distribution. However any reference distribution can be used.
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationBASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationD A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationMesothelioma: Questions and Answers
CANCER FACTS N a t i o n a l C a n c e r I n s t i t u t e N a t i o n a l I n s t i t u t e s o f H e a l t h D e p a r t m e n t o f H e a l t h a n d H u m a n S e r v i c e s Mesothelioma: Questions
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationData Mining and Machine Learning in Bioinformatics
Data Mining and Machine Learning in Bioinformatics PRINCIPAL METHODS AND SUCCESSFUL APPLICATIONS Ruben Armañanzas http://mason.gmu.edu/~rarmanan Adapted from Iñaki Inza slides http://www.sc.ehu.es/isg
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationA Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationBiomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening
, pp.169-178 http://dx.doi.org/10.14257/ijbsbt.2014.6.2.17 Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening Ki-Seok Cheong 2,3, Hye-Jeong Song 1,3, Chan-Young Park 1,3, Jong-Dae
More informationEFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationMesothelioma. 1995-2013, The Patient Education Institute, Inc. www.x-plain.com ocft0101 Last reviewed: 03/21/2013 1
Mesothelioma Introduction Mesothelioma is a type of cancer. It starts in the tissue that lines your lungs, stomach, heart, and other organs. This tissue is called mesothelium. Most people who get this
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationEvaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationWebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationCS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationA Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationAppendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.
Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study Prepared by: Centers for Disease Control and Prevention National
More informationManjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India
Volume 5, Issue 6, June 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Multiple Pheromone
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationDisease/Illness GUIDE TO ASBESTOS LUNG CANCER. What Is Asbestos Lung Cancer? www.simpsonmillar.co.uk Telephone 0844 858 3200
GUIDE TO ASBESTOS LUNG CANCER What Is Asbestos Lung Cancer? Like tobacco smoking, exposure to asbestos can result in the development of lung cancer. Similarly, the risk of developing asbestos induced lung
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationRulex s Logic Learning Machines successfully meet biomedical challenges.
Rulex s Logic Learning Machines successfully meet biomedical challenges. Rulex is a predictive analytics platform able to manage and to analyze big amounts of heterogeneous data. With Rulex, it is possible,
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationChapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationData Mining Analysis of HIV-1 Protease Crystal Structures
Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, and Rajni Garg AP0907 09 Data Mining Analysis of HIV-1 Protease Crystal Structures Gene M. Ko 1, A.
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationbusiness statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationS03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationA Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationData Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationStatistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
More informationExploring the Role of Vitamins in Achieving a Healthy Heart
Exploring the Role of Vitamins in Achieving a Healthy Heart There are many avenues you can take to keep your heart healthy. The first step you should take is to have a medical professional evaluate the
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationPredicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationPredictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar
Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar Prepared by Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com Louise.francis@data-mines.cm
More informationFREQUENTLY ASKED QUESTIONS about asbestos related diseases
FREQUENTLY ASKED QUESTIONS about asbestos related diseases 1. What are the main types of asbestos lung disease? In the human body, asbestos affects the lungs most of all. It can affect both the spongy
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationTutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
More informationX X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
More information