Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear Dimensionality Reduction, Principal Components Analysis, Isomap, Locally Linear Embedding, Laplacian Eigenmaps, Support Vector Machines, Nearest Neighbor Classification, Leave-one-out Cross Validation, k-fold Cross Validation, Cancer Microarray Data.

Table of Contents Introduction...2 Methodology...3 Dimensionality Reduction Techniques...3 Isomap...3 Locally Linear Embedding...4 Laplacian Eigenmaps...5 Classification Techniques...5 Nearest Neighbor Classification...5 Support Vector Machines...6 Experiments and Results...7 Datasets...7 Experimental Setup...8 Results...9 Visualizations... 12 Conclusion... 13

Introduction One particular property of microarray data is that number of variables is much larger than the number of samples. Other is that correlations between variables are complex and remain unknown which makes harder the direct usage of machine learning algorithms on the data. There is always a high possibility of the singularity and overfitting. Researchers develop various methods to overcome these problems on microarray data. They either select combinations of genes based on some strategies which is called as gene selection or learns the underlying structure of the data and projects it to a lower dimensional and generally more discriminative space by using dimensionality reduction techniques. In this project, we compare the results of a set of dimensionality reduction techniques for the classification of gene expression microarray data. Various classical dimensionality reduction techniques like Principal Components Analysis (PCA) were proven to be successful in the previous studies. In this study, we compare the results of these techniques with a set of non-linear dimensionality techniques including Non-linear ISOMAP, Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). After dimensionality reduction, classification methods can be applied on the projected data in the low dimensional space. We employ simple nearest neighbor classification and Support Vector Machines (SVMs) for classification of the projected data. Prior to classification, we perform two types of cross validation to optimize the parameters of dimensionality reduction techniques, k-fold cross validation in case of nearest neighbor and leave-one-out cross validation in case of SVMs. We present the results on six different cancer microarray datasets: AML Prognosis, Breast and Colon Cancer, Lymphoma (DLBCL vs. FL), Leukemia, Prostate Cancer, and Colon Cancer datasets.

Methodology Dimensionality Reduction Techniques In a classification problem, complexity of the algorithm and number of samples necessary to train the classifier is highly dependent on the number of variables. In addition to the decrease in complexity, there are many other advantages of reducing dimensionality. A smaller dimensionality in input space leads to a simpler model which is robust against variance in data caused by noise, outliers, etc. As in case of the microarray data, smaller dimensionality also enables visualization by projecting data into a lower space, which is 2D or 3D. Dimensionality reduction can be performed as either feature selection or feature extraction. In case of feature selection, a number of relevant features are selected and used for classification, or for any other purposes. In case of feature extraction, data is projected into a lower subspace which is represented as the combinations of the original dimensions. Feature extraction methods can be categorized as linear or non-linear and supervised or unsupervised. Principal Components Analysis (PCA), Factor Analysis (FA), and Multidimensional Scaling (MDS) are some of the best known linear and supervised techniques. Linear Discriminant Analysis is another linear method that is also supervised, since it uses output information. In this project, we study non-linear, unsupervised dimensionality reduction techniques including Isomap (IM), Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). In this section, we explain these techniques and also MDS in detail, since IM is basically a non-linear modification of MDS. Multidimensional Scaling (MDS) In MDS, is defined as the distance between and, and distances between each pair of points is given in advance. MDS projects data into a lower dimensional space by preserving these distances. For example, given road distances between cities, result of MDS approximates a map containing these cities. As in many dimensionality reduction techniques, MDS can be formulated as an optimization problem. MDS finds a mapping from d to k dimensions and aim in mapping is to minimize the error as the sum of distances between each pair of points in two dimensions:

( ) ( ) Mapping is defined by ( ) where ( ) is the mapping function with parameter set. It can be a linear transformation as in or can be a non-linear mapping which is called as Sammon Mapping. Isomap (IM) Isomap considers Euclidean distance for close neighboring points and estimates geodesic distances for far away points. Geodesic distance is defined on a graph where nodes correspond to data points and edges connect neighboring data points. Neighborhood can be defined by a threshold on the distance or as. Distance between any two nodes and ( ) is defined as the length of the shortest path in the graph as geodesic distance. Isomap then uses MDS to compute the reduced-dimensional positions of all the points. Locally Linear Embedding (LLE) LLE recovers global nonlinear structure from locally linear fits. Idea behind LLE is to represent each local patch of mainfold linearly. This is achieved by writing each point as a linear weighted sum of its neighbors. Given that minimize the error function: and its neighbors ( ), aim is to find reconstruction weights ( ) ( ) reflects intrinsic geometric properties of data that we want to preserve in the new space. Nearby points in original space should remain nearby. ( )

Laplacian Eigenmaps (LEM) As in LLE and IM, LEM also builds an adjacency graph of neighboring nodes. Similarly, neighborhood can be defined based on a threshold or as. Weights of edges can be binary (1 if connected, 0 otherwise) or they can be defined as Heat kernel: LEM minimizes an error function to ensure that points close to each other on the manifold are mapped close to each other in the low dimensional space by preserving local distances: ( ) Minimization is formulated as the construction of eigenmaps. Algorithm computes eigen values and vectors of Laplacian Matrix where is diagonal weight matrix constructed from weight matrix as. Classification Techniques There are various techniques that are used for classifying a new data by the help of existing data. In this project we have also utilized from nearest neighbor and support vector machine classification algorithms, in order to classify microarray data. Nearest Neighbor Classification k-nearest Neighbor is a method for classifying objects according to closest training data. [1]. Test data is classified by finding the nearest k train data and find the majority of this k data, then classify test data as the owning class of majority of the group. This is explained in detail in [2]. Suppose that there are two classes as plus and minus. We already have some training data which belongs to these classes. Our task is to classify the query point as plus or minus. This is depicted in Figure 4. Figure 4. k-nearest Neighbor Classification If k is equal to one then this is called 1-nearest neighbor or just nearest neighbor. According to this condition, The query point which is in red color in the figure, will find its closest point as plus and it will be classified as plus. If we increase k to 2, this time it will not be able to be classified since the second closest point will be minus. Both plus and minus classes will get the same score. If k will

be 5, then it will find the region of circle in the figure. As it is obvious, it will find 2 plus and 3 minus data. As the majority belongs to minus class, then red point will be classified as minus. Support Vector Machines Support Vector Machine classification technique based on the idea of decision planes [3]. A decision plane separates between a set of objects that belongs to different classes. An example of a decision plane is depicted in Figure 5. This plane has two classes of objects which belong to red or green color. The separating line is called decision boundary. The objects that are at the right side of this line belongs green class whereas at left side red objects presents. Any new object which is white at right side will be labeled as green, whereas at left side will be labeled as red. Figure 5. Decision Plane and Linear Classifier This is a classic linear classifier; however it is not that much in most of the cases of classification. More complex structure is needed in order to construct optimal separation. This is also shown in Figure 6. In order to classify red and green objects correctly, we would require a curve which is not that much simple as a line. This type of separating lines is known as hyper plane classifiers. Support Vector Machines are generated to deal such type of tasks. Figure 6. Hyper plane Classifier The basic idea of Support Vector Machines is depicted in Figure 7. The original objects are mapped by the help of some mathematical functions known as kernel functions. This mapping process is called transformation. The mapped version, it is much easier since a linear classifier is enough to separate two classes. Instead of constructing the complex curve, we only need to find an optimal line which will separate green and red objects. Figure 7. Transformation

Experiments and Results Datasets Dataset Classes Samples Dimensionality AML Prognosis Breast and Colon DLBCL Leukemia Prostate Colon Cancer (I2000) Remission 28 Relapse 26 Breast 31 Colon 21 DLBCL 58 FL 19 ALL 47 AML 25 Normal 50 Prostate 52 Tumor 42 Normal 20 54 12625 52 22283 77 7070 72 5147 102 12533 62 2000 AML Prognosis (GSE2191): The classification problem on this dataset is defined as distinguishing patients with acute myeloid leukemia (AML) according to their prognosis after treatment (remission or relapse of disease). Most of the patients enter complete remission after treatment, but a significant number of them experience relapse with resistant disease. Breast and Colon (GSE3726): Predictive gene signatures can be measured with samples stored in RNAlater which preserves RNA. In this dataset, there are a number of breast or colon specific genes that are predictive of the relapse status. Frozen samples from the original dataset are used to distinguish between colon and breast cancer patients based on gene expression values. DLBCL: Diffuse large B-cell lymphomas (DLBCL) and follicular lymphomas (FL) are two B- cell lineage malignancies that have very different clinical presentations, natural histories and response to therapy. However, FLs frequently evolve over time and acquire the morphologic and clinical features of DLBCLs and some subsets of DLBCLs have chromosomal translocations characteristic of FLs. Aim of the gene-expression based classification is to distinguish between these two lymphomas. Leukemia: This dataset contains information on gene-expression in samples from human AML and acute lymphoblastic leukemia (ALL). Prostate: This dataset contains gene expression measurements for samples of prostate tumors and adjacent prostate tissue not containing tumor. Colon Cancer: This dataset contains gene expressions with highest minimal intensity across tumor and normal colon tissues.

Experimental Setup There are two different parameters for IM, LLE, and LEM: reduced dimensionality k and number of neighboring nodes n. In order to optimize these parameters, we perform a grid search on intervals suggested in the literature. For k, we search from 2 to 15, and we search from 4 to 16 for n. As shown in an example table below, we obtain accuracies for each combination, which correspond to cells of the grid. After table is filled, we find the cell that contains the maximum accuracy and use its row and column values as k and n values, respectively. In that particular case, maximum value is obtained when n equals 4 and k equals either 10 or 11. If there is more than one cell that contains the maximum, we take the minimum indices for performance issues. According to this table, we select 10 for k and 4 for n. k\n 4 5 6 7 8 9 10 11 12 13 14 15 16 2 42.59 55.55 53.70 59.25 50.00 61.11 57.40 55.55 59.25 62.96 57.40 62.96 61.11 3 51.85 64.81 48.14 46.29 42.59 50.00 53.70 50.00 53.70 64.81 59.25 59.25 57.40 4 55.55 62.96 53.70 51.85 55.55 53.70 53.70 62.96 61.11 59.25 57.40 57.40 53.70 5 50.00 62.96 59.25 64.81 48.14 53.70 61.11 57.40 55.55 59.25 53.70 53.70 53.70 6 62.96 59.25 62.96 62.96 57.40 55.55 61.11 57.40 61.11 55.55 55.55 72.22 70.37 7 59.25 61.11 59.25 57.40 51.85 51.85 61.11 61.11 61.11 62.96 53.70 68.51 55.55 8 55.55 57.40 59.25 50.00 46.29 55.55 59.25 59.25 61.11 61.11 61.11 68.51 53.70 9 55.55 55.55 50.00 57.40 48.14 57.40 57.40 55.55 50.00 68.51 61.11 68.51 53.70 10 55.55 53.70 55.55 55.55 57.40 55.55 53.70 50.00 46.29 72.22 62.96 57.40 48.14 11 75.92 50.00 50.00 57.40 53.70 50.00 48.14 48.14 51.85 74.07 57.40 59.25 51.85 12 75.92 46.29 53.70 57.40 53.70 50.00 57.40 50.00 57.40 72.22 57.40 55.55 51.85 13 70.37 61.11 61.11 53.70 51.85 51.85 57.40 48.14 61.11 62.96 53.70 57.40 48.14 14 70.37 53.70 61.11 53.70 53.70 50.00 51.85 50.00 62.96 64.81 55.55 57.40 59.25 15 66.66 57.40 61.11 57.40 51.85 51.85 53.70 50.00 57.40 62.96 51.85 51.85 50.00

Results In our experiments, we use two different experimental setups to obtain the accuracies. In this section, we explain the details of these setups, and present our results for each dataset. First one is k-fold cross validation. For each cell of the grid above, we k times divide the data into two as train and test data in a way that number of test samples is always 10. For example, there are 54 samples in AML dataset. For each parameter combination, we permute the samples, and use 44 samples as train data and 10 as test data. In that case, there are 5 folds and it is a 5-fold cross validation. Accuracy is reported as the average of accuracies obtained at each fold. For this setup, we obtain results for PCA, IM, LLE, LEM and use nearest neighbor as classifier. Results are shown in the table below. Highest results for each dataset are highlighted as red. According to this table, values of reduced dimensionality and neighborhood highly vary for different datasets. However, optimized values of these parameters seem close for different techniques applied to the same dataset. For example, in case of AML Prognosis, reduced dimensionality is either 3 or 2 except IM. From that, we can infer that this data can be projected into a very small dimensionality compared to its original dimensionality (12625) and accuracies around 60-70% can be achieved in that space. On the other hand, IM produces the best results on this dataset where k equals to 9. Dataset Method Dimension Neighborhood Accuracy (%) PCA 3-61.11 AML Prognosis IM 9 5 77.45 LLE 2 8 75.56 LEM 3 4 64.81 PCA 14-90.38 Breast&Colon IM 3 5 96.67 LLE 12 5 100.0 LEM 11 4 90.38 PCA 6-85.71 DLBCL IM 5 15 91.67 LLE 8 10 97.14 LEM 6 10 87.01 PCA 11-91.67 Leukemia IM 7 7 96.77 LLE 5 8 93.65 LEM 15 5 94.44 Prostate PCA 11-76.47 IM 3 12 80.0 LLE 15 14 83.0

LEM 7 10 79.41 PCA 9-80.65 Colon IM 11 4 90.0 LLE 4 4 93.33 LEM 2 8 90.32 Non-linear methods beat the PCA in almost all datasets. Accuracies obtained with PCA is approximately 10% less than other methods in some of the datasets that can be considered less separable including AML Prognosis and Colon datasets. Some other datasets like Breast & Colon, Leukemia generally have high accuracies for all of the methods used. Even for these datasets, three non-linear methods always have higher accuracy values compared to PCA. This shows the importance of modeling non-linearity in microarray datasets. As for the comparison of non-linear techniques with each other, IM and LLE have a clear superiority to LEM. For all of the datasets, highest accuracies are achieved by either LLE or IM. Following the literature, LEM produce the most varying results. Second setup is leave-one-out cross validation. This time, we number-of-samples times divide the data into two as train and test in a way that each time a different sample is used as test data and remaining ones are used as training data. Accuracy is reported as the average of accuracies obtained at each run. For this setup, we obtain results for IM, LLE, and LEM and use SVM with linear kernel as classifier. Results are shown in the table below. Highest results that are highlighted as red are obtained with LLE for each dataset. Following LLE, IM and LEM produce similar results. One interesting point, LEM is able to reach the same, high results with IM and LLE, when SVM is used for classification instead of nearest neighbor. Even though it is not completely fair to compare these results with the previous table, since they use different experimental setups, we observe that results of the second setup are higher for each dataset, except Breast & Colon dataset and Colon dataset. Results are worse compared to the first setup in case of Breast & Colon dataset, and almost same accuracies are obtained in case of Colon dataset. Dataset Method Dimensionality Neighborhood Accuracy (%) AML Prognosis IM 8 9 72.34 LLE 11 10 78.72 LEM 11 4 75.93 Breast&Colon IM 10 5 88.57 LLE 2 4 93.33 LEM 10 6 88.70 DLBCL IM 11 13 91.55 LLE 13 11 100.0 LEM 11 8 96.10

Leukemia IM 2 16 98.51 LLE 8 15 100.0 LEM 10 6 98.61 Prostate IM 9 7 88.30 LLE 15 12 90.0 LEM 9 12 85.29 Colon IM 12 5 91.43 LLE 2 4 93.33 LEM 10 6 91.94

Visualizations In this section, we project different datasets by using each dimensionality reduction technique once. We use the second experimental setup for visualizations presented in this section. In this setup, IM achieves the highest accuracy on Leukemia dataset when reduced dimensionality is 2. As can be seen from the figure below, data is linearly separable in 2D space. Similarly, Breast & Colon dataset is linearly separable, and LLE achieves the highest accuracy on this dataset, when reduced dimensionality is 2. Results of LEM and PCA on Colon dataset and Prostate dataset are shown, respectively. In these cases, data points are not clearly separable, confirming the results from the previous section. Especially, result of PCA is too messy. PCA is only good at representing data lying on a linear subspace, since it is linear method. However, these visualizations show that microarray data that we use has a more complex structure than linear, and it cannot be captured by PCA projecting the data into a very low dimensional space. On the other hand, non-linear methods, especially LLE and IM preserve more information about the data such as locality that shows neighborhood relationships. These local relationships construct the intrinsic geometric properties of the data, that non-linear methods are designed to recover in a lower dimensionality. LLE onbreast and Colon Dataset LEM on Colon Dataset IM on Leukemia PCA on Prostate

Conclusion In this project, we compared four different, one linear and three non-linear, dimensionality reduction techniques by using two different classifiers. We optimized the parameters of dimensionality reduction techniques by using cross validation and presented our results with two different setups for six different datasets. We observed a significant decrease with PCA compared to non-linear methods. LLE and IM showed the best performances across setups and datasets in a consistent manner with the literature. These two methods preserve the underlying structure of the data in a better way than other methods when data projected to a new space with a very small dimensionality compared to its original dimensionality.

References 1) K-Nearest Neighbor Algorithm. Retrieved January 14, 2013, from http://en.wikipedia.org/wiki/knearest_neighbor_algorithm. 2) K-Nearest Neighbors. Retrieved January 14, 2013, from http://www.statsoft.com/textbook/knearest-neighbors/. 3) Support Vector Machines. Retrieved January 14, 2013, from http://www.statsoft.com/textbook/support-vector-machines/. 4) Ethem Alpaydın. Introduction to Machine Learning, second edition. The MIT Press.