1 APPLICATION OF POPULATION-BASED TECHNOLOGY IN SELECTION OF GLYCAN MARKERS FOR CANCER DETECTION A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by Haofei Fang Summer 212
3 iii Copyright 212 by Haofei Fang All Rights Reserved
4 iv DEDICATION This thesis is dedicated to my dear fiancé, who supported me when I was in perplexity during the thesis preparation. She is one of the great powers pushing me to go forward. Also it is dedicated to my parents, who taught me the way to face difficulties. Their support encourages me to find my way to success.
5 v ABSTRACT OF THE THESIS Application of Population-Based Technology in Selection of Glycan Markers for Cancer Detection by Haofei Fang Master of Science in Computer Science San Diego State University, 212 Recent advances in computer technology and in molecular biology have greatly influenced and promoted the field of bioinformatics. Parts of these advances are new high throughput platforms for biomarker discovery and new algorithms for feature selection and classification. This thesis is dedicated to a class of feature selection and classification algorithms that are based on a new paradigm of artificial intelligence and pattern recognition known as swarm intelligence. A particular algorithm considered is Ant Colony Optimization (ACO) which is applied to a recently emerged biomarker platform based on printed glycan arrays (PGA). The thesis proposes an implementation of the ACO which is specially tuned for diagnosis of cancer using PGA data. The implementation is evaluated on real clinical data obtained from the School of Medicine of NYU, which contain 65 control samples of highrisk subjects exposed to asbestos and 5 subjects diagnosed with malignant mesothelioma. The results are compared to artificially generated data which have general characteristics similar to the original real data.
6 vi TABLE OF CONTENTS PAGE ABSTRACT...v LIST OF TABLES... ix LIST OF FIGURES...x ACKNOWLEDGEMENTS... xiii CHAPTER 1 INTRODUCTION MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY Mesothelioma Study, Demographics and Goals Printed Glycan Array General Information Structure of Data for MATLAB Data Preprocessing Normalization Quantile Normalization Intra -Slide Normalization Inter-Slide Normalization Transformation Transformation Necessity Implementation and Result FEATURE SELECTION AND CLASSIFICATION Univariate Feature Selection Multivariate Feature Selection Forward Sequential Feature Selection (FWD) Recursive Feature Elimination (RFE) Genetic Algorithm (GA) Ant Colony Optimization (ACO) Classification and Regression Trees (C&RT/C4.5)...23
7 vii Random Forest Trees (RF) Classifiers Multiple Logistic Regression (MLR) Generalized Linear Model (GLM) Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Naive Bayes/Mahalanobis Distance K-Nearest Neighbor (KNN) Classifier Performance Measures Accuracy Area Under the ROC Curve (AUC) Cross Validation Leave-One-Out Cross Validation (LOOCV) K-fold Cross Validation Hold-Out Cross Validation ANT COLONY OPTIMIZATION ALGORITHM Theory and the Algorithm Implementation Optimization Objective M-Files Step 1: Initialization Step 2: Population Step 3: Evaluation Step 4: Deposition Step 5: Preparation for New Iteration Optional Step: Randomization Empirical Tuning of ACO Parameters Number of Ants Stopping Criteria Application of ACO to Artificial Data Generation of Artificial Data Contaminated Artificial Data...54
8 4.4.3 Results Application of ACO to Mesothelioma Study Experiment Design Results COMPARISON OF ACO WITH OTHER APPROACHES IN CLASSIFICATION Genetic AUC Optimizer (GAUC) Experiment Design Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data Results Efficiency Test Stability Test Cross-Validation Performance on Raw and Contaminated Mesothelioma Data CONCLUSION...91 REFERENCES...93 viii
9 ix LIST OF TABLES PAGE Table 3.1. WMW Rank Calculation Demonstration Table 3.2. Result of Applying WMW to Mesothelioma Data Set...18 Table 4.1. Parameters to be Initialized for ACO...41 Table 4.2. Summary of Number of Ants Tuning (m = 4)...5 Table 4.3. Summary of Number of Ants Tuning Verification (m = 6)...5 Table 4.4. OCI-GID Reference Table...54 Table 4.5. Contamination Parameters for Artificial Data...56 Table 4.6. Selected Values of Parameters for Contamination...56 Table 4.7. The Performance of WMW on Artificial Data without/with Contamination...59 Table 4.8. Performance of WMW and ACO on Artificial Data...59 Table 4.9. Performance of WMW and ACO Applied on Artificial Data - Bootstrap...65 Table 4.1. Result of Comparing ACO Repeats and WMW on Mesothelioma Data Set...7 Table Result of Comparing ACO Repeats and WMW on Subsampled Mesothelioma Data Set...73 Table 5.1. Execution Time for Each Combination...77 Table 5.2. Average AUC Values for Each Combination...77 Table 5.3. Results of Stability Experiment...86 Table 5.4. Cross Validation on Mesothelioma Data Best AUC Values...87 Table 5.5. Cross Validation on Mesothelioma Data Features at Best AUC Values...87 Table 5.6. Cross Validation on Mesothelioma Data Best Stability...88 Table 5.7. Cross Validation on Mesothelioma Data Features at Best Stability...88 Table 5.8. Compare ACO and WMW on Raw Mesothelioma Data and Normalized Data...88 Table 5.9. Compare ACO and WMW on Contaminated Mesothelioma Datasets Repeated Training...89 Table 5.1. Compare ACO and WMW on Contaminated Mesothelioma Datasets Bootstrap...9
10 x LIST OF FIGURES PAGE Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB....7 Figure 2.2. Diagrammatic explanation of quantile normalization of training and test data Figure 2.3. Raw data and transformed data with different lambda using Box-Cox transformation Figure 3.1. Best features distribution plot for mesothelioma data set Figure 3.2. Plotting GA Fitness (Best and Average Values) Figure 3.3. Maximum-margin hyperplane and margins for an SVM trained with samples from two classes Figure 3.4. ROC curve space Figure 4.1. The flow chart of ACO for feature selection Figure 4.2. Flow chart of moving ants for ACO Figure 4.3. Flow chart for solution evaluation and pheromone table updating Figure 4.4. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.5. Plot of ACO performance, using 25 ants to select 4 features in 1 iterations Figure 4.6. Plot of ACO performance, using 5 ants to select 4 features in 1 iterations Figure 4.7. Plot of ACO performance, using 1 ants to select 4 features in 1 iterations Figure 4.8. Plot of ACO performance, using 2 ants to select 4 features in 1 iterations Figure 4.9. Plot for ACO stopping criteria analysis demonstrating the trend of the ACO performance with iteration increasing Figure 4.1. Distribution of the best AUC values in the 1 repeats of ACO function without/with stopping criteria Figure Plots of patients distributions for the best features of artificial data without noise contamination Figure Plots of patients distributions for the best features of artificial data contamination level: Tiny....57
11 Figure Plots of patients distributions for the best features of artificial data with medium contamination level Figure Plots of patients distribution for the best features of artificial data with high contamination level Figure Histogram of selected features obtained by repeated ACO applied to artificial data without contamination....6 Figure Histogram of AUC values obtained by repeated ACO applied to artificial data without contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with tiny contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with mediun contamination Figure 4.2. Histogram of AUC values obtained by repeated ACO applied to artificial data with medium contamination Figure Histogram of selected features obtained by repeated ACO applied to artificial data with heavy contamination Figure Histogram of AUC values obtained by repeated ACO applied to artificial data with heavy contamination Figure Repeated ACO applied to re-sampled artificial data without contamination Figure Repeated ACO applied to re-sampled artificial data with tiny contamination Figure Repeated ACO applied to on re-sampled artificial data with medium contamination Figure Repeated ACO applied to on re-sampled artificial data with heavy contamination Figure Repeated WMW applied to on re-sampled original artificial data Figure Repeated WMW applied to on re-sampled artificial data with tiny contamination Figure Repeated WMW applied to on re-sampled artificial data with medium contamination Figure 4.3. Repeated WMW applied to on re-sampled artificial data with heavy contamination Figure Histogram of repeated ACO on original mesothelioma data....7 xi
12 Figure Histogram of AUC values obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated ACO applied to subsampled mesothelioma data Figure Histogram of AUC values obtained by repeated WMW applied to subsampled mesothelioma data Figure Histogram of selected features obtained by repeated WMW applied to subsampled mesothelioma data Figure 5.1. The fitness progress of GAUC on mesothelioma data Figure 5.2. Histogram for selected features in stability experiment ACO-GLM Figure 5.3. Histogram for selected features in stability experiment ACO-SVM Figure 5.4. Histogram for selected features in stability experiment ACO-GA Figure 5.5. Histogram for selected features in Stability Experiment ACO FLD Figure 5.6. Histogram for selected features in stability experiment WMW GLM....8 Figure 5.7. Histogram for selected features in stability experiment WMW SVM....8 Figure 5.8. Histogram for selected features in stability experiment WMW GA Figure 5.9. Histogram for selected features in stability experiment WMW FLD Figure 5.1. Histogram for selected features in stability experiment GA GLM Figure 5.11 Histogram for selected features in stability experiment GA SVM Figure Histogram for selected features in stability experiment GA GA Figure Histogram for selected features in stability experiment GA - FLD Figure Histogram for selected features in stability experiment FWD GLM Figure Histogram for selected features in stability experiment FWD SVM Figure Histogram for selected features in stability experiment FWD GA Figure Histogram for selected features in stability experiment FWD FLD Figure Cross validation results on contaminated mesothelioma dataset....9 xii
13 xiii ACKNOWLEDGEMENTS Dr. Marko Vuskovic has been the ideal thesis supervisor. His sage advice, patient encouragement as well as cogent criticisms aided the writing of the thesis. I would also like to thank Dr. Joseph Lewis whose suggestions to this study were greatly needed.
14 1 CHAPTER 1 INTRODUCTION With the development of computer capabilities and deployment of advanced algorithms, biomarker discovery is becoming an important topic in bioinformatics applications of computational biology, including applications such as gene and SNP selection from high-dimensional data. The stability with respect to sampling variation or robustness of such selection processes has received attention recently. Robustness of bio-markers is an important issue, as it may greatly influence subsequent biological validations. Besides the process of feature selection, classification plays an important role in the procedure of bio-marker s discovery as well. It is usually used as performance evaluation based on the result from feature selection. Numbers of methods could be involved in this process, including logistic regression, fisher linear discriminant, support vector machine and many others. Recently, the Ant Colony Optimization and Genetic Algorithm are introduced to implement the classification. The investigators at the Glycomic Laboratory of the NYU, School of Medicine  are evaluating a novel means of detecting mesothelioma and lung cancer early through what could ultimately be a simple blood test. They have developed a unique cancer diagnostic approach that utilizes a printed glycan array (PGA). This new high-throughput platform contains 286 carbohydrate molecules (glycans) that are often expressed on the surfaces of human cells, including abnormal sugars produced by lung cancer cells in response to changes induced by the cancer process. Researchers can measure antibodies against these abnormal glycans in the blood of people with mesothelioma or lung adenocarcinoma or those at risk for these diseases. This test could also be a tool for identifying new therapeutic targets. The scientists are developing this array as a global way of looking at molecules that may serve as very early markers to indicate that something is wrong inside lung or mesothelium cells. This information could be used to determine if someone is at risk for the mesothelioma or lung cancer or if someone who already has the disease is likely to do poorly and may need more aggressive therapy.
15 2 One of the basic problems in bioinformatics is that biomarker platforms deal with generally large number of features that can range from hundreds to thousands. Most of them are non-informative and ineffective in discrimination of patients as control or case group. Thus, the feature selection is used to select the most relevant glycans, or remove the noisy ones. There are several feature selection algorithms available nowadays. Basically, they can be divided into two groups, univariate and multivariate feature selection algorithms. Univariate methods treat existing candidate features individually. The performance of each feature in discrimination is evaluated separately. All features are then ranked by their performance and the top features would be used to train the classifier. In multivariate methods, features are treated as a group of dependent variables. Many algorithms for multivariate feature selection are developed, such as Recursive Feature Accumulation (RFA), Recursive Feature Elimination (RFE) and sequential forward/backward feature selections. These algorithms are developed as a compromise to global optimization which in case of large number of features becomes infeasible. There are, however, heuristic algorithms for feature selection which perform nearly real global optimization, such as Genetic Algorithm and Ant Colony Optimization. The latter will be the focus of this thesis. Ant Colony Optimization was initially proposed by Marco Dorigo in 1992 in his Ph.D Thesis . It is a probabilistic technique for solving computational problems which can be reduced to finding a good path through a graph. By moving on the map from data model, ants can communicate with each other to transform information of the goal. ACO in this research is used to find an optimal subset from candidate features. The goal of this thesis is to apply the ideas of ACO to the diagnosis of cancer diseases based on data obtained from PGA. The study includes implementation of ACO based algorithms, analysis of performance and tuning of algorithmic parameters, and demonstration of the application of the developed software on the diagnosis of mesothelioma and lung cancer. The implementation of ACO includes computation of an important classification performance measure called area under the Receiver Operating Characteristic Curve (AUC), directly as opposed to computation of AUC after feature selection and projection. By comparing the results of application of ACO and other F/S method on PGA data, this study provides a better view on this new approach in cancer detection. Although ACO
16 3 doesn t achieve the best performance among other methods, it performs well with noisy data, when other algorithms fail. The material in this thesis is organized as follows: In Chapter 1, we introduce general concepts of technologies used in the research and the organization of this thesis. In Chapter 2, we introduce details about the PGA data for mesothelioma study. In Chapter 3, we discuss other feature selection and classification, including univariate and multivariate feature selection algorithms, classification models and the methods for classifier evaluation. In Chapter 4, we discuss the implementation of ACO including the parameter tuning and evaluation of ACO with both, artificial data and real mesothelioma data. Chapter 5 describes the experiments designed to evaluate the performance of different feature selection methods, combined with different classification algorithms. Chapter 6 presents the conclusion from experiments and discusses a possible future work which is enabled by the research in this thesis.
17 4 CHAPTER 2 MESOTHELIOMA STUDY AND PRINTED GLYCAN ARRAY 2.1 MESOTHELIOMA STUDY, DEMOGRAPHICS AND GOALS Mesothelioma, more precisely malignant mesothelioma (MM), is a rare form of cancer that develops in the protective lining that covers many of the body s internal organs, the mesothelium. It is usually caused by exposure to asbestos . Its most common site is the pleura (outer lining of the lungs and internal chest wall), but it may also occur in the peritoneum (the lining of the abdominal cavity), the heart, the pericardium (a sac that surrounds the heart)  or tunica vaginalis. Most people who develop mesothelioma have worked on jobs where they inhaled asbestos and glass particles, or they have been exposed to asbestos dust and fiber in other ways. Unlike lung cancer, there is no association between mesothelioma and smoking, but smoking greatly increases the risk of other asbestos-related cancers . Those who have been exposed to asbestos often utilize attorneys to collect damages for asbestos-related disease, including mesothelioma. Compensation via asbestos funds or lawsuits is an important issue in mesothelioma. The symptoms of mesothelioma include shortness of breath due to pleural effusion (fluid between the lung and the chest wall) or chest wall pain, and general symptoms such as weight loss. The diagnosis may be suspected with chest X-ray or CT scan, and is confirmed with a biopsy (tissue sample) and microscopic examination. Diagnosing mesothelioma is often difficult, because the symptoms are similar to those of a number of other conditions. Diagnosis begins with a review of the patient s medical history. A history of exposure to asbestos may increase clinical suspicion for mesothelioma. A physical examination is performed, followed by chest X-rays and often lung function test.
18 5 The life expectancy for mesothelioma patients is generally reported as less than one year following diagnosis. However, a patient s prognosis is affected by several factors, including how early the cancer is diagnosed and how aggressively it is treated. If a problem is suspected, a physician may request several diagnostic tests. These typically include medical imaging techniques such as: X-rays; CT scans; PET scans; MRI scans. A combination of these tests is often used to determine the location, size and type of cancer. Biopsy procedures are often requested following an imaging scan to test samples of fluid and tissue for the presence of cancerous cell. In this research we will demonstrate early detection and/or diagnosis of malignant mesothelioma based on Printed Glycan Arrays (PGAs). The Mesothelioma study  will include 65 subjects exposed to asbestos, but not diagnosed with MM, and 5 patients diagnosed with MM. The data were obtained from serum collected by Prof. Harvey Pass, MD in the School of medicine at NYU, and developed on PGAs at Cellexicon, Inc., La Jolla, CA. The data and related results were part of the NIH-NCI grant  and are published in several publications, including  and [6, 9]. In the following sections we will describe the PGAs and their functionality and various data preprocessing algorithms which are used before ACO-based feature selection and classification. 2.2 PRINTED GLYCAN ARRAY In medicine, a biomarker can be a traceable substance that is introduced into an organism as a means to examine organ function or other aspects of health. It can also be a substance whose detection indicates a particular disease state. For example, the presence of an antibody may indicate an infection. More specifically in this research, a biomarker, glycan, indicates a change in expression or state of the immune system that correlates with the risk or progression of mesothelioma, or with the susceptibility of the disease to a given treatment. Biochemical biomarkers are often used in clinical trials, where they are derived from bodily fluids that are easily available to the early phase researchers.
19 General Information In the last five years, a new biomarker-discovery platform has emerged based on glycan arrays , which has some advantages over nucleic acid-based and other platforms. The printed glycan arrays are similar to DNA microarrays, but contain deposits of various carbohydrate structures (glycans) instead of spotted DNAs. Most of these glycans can be found on the surfaces of normal human cells, human cancer cells, and on the surfaces of many human infectious agents such as bacteria, viruses, and other pathogenic microorganisms. Transformation of cells from healthy to pre-malignant and malignant is associated with the appearance of abnormal glycosylation on proteins and lipids presented on the surface of these cells. The malignancy-related abnormal glycans are called tumorassociated carbohydrate antigens (TACA). There is growing evidence that numerous TACAs are immunogenic, and that the human immune system can generate antibodies against them. Since multiple glycans arrayed on PGAs are either known TACAs or closely related structures, the antibodies present in human sera that bind to glycans on PGAs can indicate the status of response of the immune system to human malignancies. A printed glycan array (PGA) consists of a glass slide coated with a chemically reactive surface on which various glycans are covalently attached using standard aminocoupling chemistry and contact printing technology. A PGA slide contains several sub-arrays of the entire currently available glycan library in the form of microscopic glycan deposits of size about 8 microns that are identical duplicates. For each slide, the data from each subarray will be processed as the raw data to which we are going to apply processing and classification algorithms. The advantage of a potential PGA-based serum test  for early detection of cancer and cancer risk can be summarized as follows: (a) minimal invasiveness of serum sampling; (b) minimal sampling variability, in contrast to well-known heterogeneity of solid tissue samples; (c) stability of antibodies, (d) low cost associated with technology; (e) low labor intensity and short duration of the test; (f) broad scope of the test, i.e. the test doesn t have to be narrowly targeted to a particular disease, e.g. cancer type. All these advantages make the PGA platform attractive for early detection of disease and for the potential application in screening of the general population.
20 Generally, there are five steps introduced to achieve the PGA data: printing of glycan arrays, development of arrays with serum samples, scanning, quantification and data aggregation. After all these steps, we can form a data structure based on the quantified PGA data. The detail of the data structure is discussed in next section. Due to the relatively moderate discriminatory power of individual glycans of PGA arrays, see Chapter 4, we can see the necessity of applying such an ant-based feature selection algorithm to find an optimal combination of several biomarkers for classification, modeling and other purposes. The results comparing the discriminatory power of individual glycans and combination of glycans would be discussed in Chapter Structure of Data for MATLAB We are working with data as a structure consisting mainly of a 2-dimension matrix, two row vectors, and two column vectors mainly and other auxiliary data (see Figure 2.1). One of the row vectors is called Original Column Index (OCI). The data in this vector denotes the original index of features in the matrix after data quantification and before we extract some of them from the matrix. These indices correspond to the order of glycans used in PGA library. A second row vector contains the Glycan Identification (GID) assigned to distinct features. Each GID is a unique three-digit number which denotes a specific glycan structure used in the array. One of the column vectors is the Patient Identification (PID). Each patient is assigned a unique ID for further exploration. A second row vector (y) contains binary labels, i.e. membership to control or case class for each patient. 7 OCI 1xd GID 1xd PID nx1 X nxd Y nx1 Figure 2.1. The data structure of mesothelioma PGAs data for MATLAB.
21 8 Each data element of the matrix (X) represents a fluorescent intensity in relative units (FRU) associated with the binding of anti-glycan antibodies from a serum of a patient (rows) and glycans (columns). The most critical part in the above data structure is the 2-dimension matrix X. Since the sample data for different patients could be collected by different physicians and equipment at different locations and even in different years, the data might not be comparable. These differences could introduce biases and other unexpected impacts on the data, which would reduce the reliability of the glycans (features) and even cause incorrect classification. Therefore, an important step is necessary: the preprocessing of raw data, explained in the following section. 2.3 DATA PREPROCESSING In the real world, the data are generally incomplete, noisy and inconsistent . They could be lacking attribute values, lacking certain attributes of interest and containing errors or outliers. Sometimes the data might contain discrepancies caused by the variety of equipment and environment. Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure . Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. There are a number of different tools and methods used for preprocessing. The most common tools are normalization and transformation. In this study, the ultimate goal is to extract most significant information from the PGAs to classify patients correctly. From this point of view, every step before the classification, including PGAs development, data normalization and transformation as well as feature selection, can be considered as preprocessing. Considering the main object in this study of evaluating the performance of the Ant-based algorithm in data feature selection, we will only consider normalization and transformation as preprocessing steps. In the following two sub-sections, we discuss normalization, which organizes data for more efficient access; and transformation, which manipulates raw data to produce a single input.
22 Normalization In one usage in statistics, normalization is the process of isolating statistical error in repeated measured data. In another usage, normalization refers to the division of multiple sets of data by a common variable in order to negate the variable s effect on the data, thus allowing underlying characteristics of the data set to be compared. This allows data on different scales to be compared, by bringing them to a common scale. For example, in this study a PGA image is developed with patient s serum and glycans on the glass slides, which are scanned by a laser scanner for quantification. The image for the patients could vary because of the equipment, location or even climate differences. To handle such a problem, normalization is necessary to ensure the data for patients are comparable so that the following steps, feature selection and classification, will achieve reliable results. In this particular study, we use three different normalization methods: Quantile Normalization, Intra-slide Normalization and Inter-slide Normalization. These three methods are going to be applied to the raw data. The resulting processed data will be applied for the further feature selection procedure QUANTILE NORMALIZATION The goal of quantile normalization is to ensure that the distribution of intensities across all variables (glycans) is the same for each patient . The method is motivated by the idea that a quantile-quantile plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is other than a diagonal line. This concept is extended to n dimensions so that if all n data vectors have the same distribution, then plotting the quantiles in n dimentions gives a straight line. This suggests we could make a set of data have the same distribution if we project the point of our n dimensional quantile plot onto the diagonal. The critical part of this normalization is to find the reference distribution. Generally a reference distribution will be one of the standard statistical distributions such as the Gaussian distribution or the Poisson distribution. The reference distribution can be generated randomly or from taking regular samples from the cumulative distribution function of the distribution. However any reference distribution can be used.