Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data


 Rudolf Casey
 1 years ago
 Views:
Transcription
1 CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear Dimensionality Reduction, Principal Components Analysis, Isomap, Locally Linear Embedding, Laplacian Eigenmaps, Support Vector Machines, Nearest Neighbor Classification, Leaveoneout Cross Validation, kfold Cross Validation, Cancer Microarray Data.
2 Table of Contents Introduction...2 Methodology...3 Dimensionality Reduction Techniques...3 Isomap...3 Locally Linear Embedding...4 Laplacian Eigenmaps...5 Classification Techniques...5 Nearest Neighbor Classification...5 Support Vector Machines...6 Experiments and Results...7 Datasets...7 Experimental Setup...8 Results...9 Visualizations Conclusion... 13
3 Introduction One particular property of microarray data is that number of variables is much larger than the number of samples. Other is that correlations between variables are complex and remain unknown which makes harder the direct usage of machine learning algorithms on the data. There is always a high possibility of the singularity and overfitting. Researchers develop various methods to overcome these problems on microarray data. They either select combinations of genes based on some strategies which is called as gene selection or learns the underlying structure of the data and projects it to a lower dimensional and generally more discriminative space by using dimensionality reduction techniques. In this project, we compare the results of a set of dimensionality reduction techniques for the classification of gene expression microarray data. Various classical dimensionality reduction techniques like Principal Components Analysis (PCA) were proven to be successful in the previous studies. In this study, we compare the results of these techniques with a set of nonlinear dimensionality techniques including Nonlinear ISOMAP, Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). After dimensionality reduction, classification methods can be applied on the projected data in the low dimensional space. We employ simple nearest neighbor classification and Support Vector Machines (SVMs) for classification of the projected data. Prior to classification, we perform two types of cross validation to optimize the parameters of dimensionality reduction techniques, kfold cross validation in case of nearest neighbor and leaveoneout cross validation in case of SVMs. We present the results on six different cancer microarray datasets: AML Prognosis, Breast and Colon Cancer, Lymphoma (DLBCL vs. FL), Leukemia, Prostate Cancer, and Colon Cancer datasets.
4 Methodology Dimensionality Reduction Techniques In a classification problem, complexity of the algorithm and number of samples necessary to train the classifier is highly dependent on the number of variables. In addition to the decrease in complexity, there are many other advantages of reducing dimensionality. A smaller dimensionality in input space leads to a simpler model which is robust against variance in data caused by noise, outliers, etc. As in case of the microarray data, smaller dimensionality also enables visualization by projecting data into a lower space, which is 2D or 3D. Dimensionality reduction can be performed as either feature selection or feature extraction. In case of feature selection, a number of relevant features are selected and used for classification, or for any other purposes. In case of feature extraction, data is projected into a lower subspace which is represented as the combinations of the original dimensions. Feature extraction methods can be categorized as linear or nonlinear and supervised or unsupervised. Principal Components Analysis (PCA), Factor Analysis (FA), and Multidimensional Scaling (MDS) are some of the best known linear and supervised techniques. Linear Discriminant Analysis is another linear method that is also supervised, since it uses output information. In this project, we study nonlinear, unsupervised dimensionality reduction techniques including Isomap (IM), Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). In this section, we explain these techniques and also MDS in detail, since IM is basically a nonlinear modification of MDS. Multidimensional Scaling (MDS) In MDS, is defined as the distance between and, and distances between each pair of points is given in advance. MDS projects data into a lower dimensional space by preserving these distances. For example, given road distances between cities, result of MDS approximates a map containing these cities. As in many dimensionality reduction techniques, MDS can be formulated as an optimization problem. MDS finds a mapping from d to k dimensions and aim in mapping is to minimize the error as the sum of distances between each pair of points in two dimensions:
5 ( ) ( ) Mapping is defined by ( ) where ( ) is the mapping function with parameter set. It can be a linear transformation as in or can be a nonlinear mapping which is called as Sammon Mapping. Isomap (IM) Isomap considers Euclidean distance for close neighboring points and estimates geodesic distances for far away points. Geodesic distance is defined on a graph where nodes correspond to data points and edges connect neighboring data points. Neighborhood can be defined by a threshold on the distance or as. Distance between any two nodes and ( ) is defined as the length of the shortest path in the graph as geodesic distance. Isomap then uses MDS to compute the reduceddimensional positions of all the points. Locally Linear Embedding (LLE) LLE recovers global nonlinear structure from locally linear fits. Idea behind LLE is to represent each local patch of mainfold linearly. This is achieved by writing each point as a linear weighted sum of its neighbors. Given that minimize the error function: and its neighbors ( ), aim is to find reconstruction weights ( ) ( ) reflects intrinsic geometric properties of data that we want to preserve in the new space. Nearby points in original space should remain nearby. ( )
6 Laplacian Eigenmaps (LEM) As in LLE and IM, LEM also builds an adjacency graph of neighboring nodes. Similarly, neighborhood can be defined based on a threshold or as. Weights of edges can be binary (1 if connected, 0 otherwise) or they can be defined as Heat kernel: LEM minimizes an error function to ensure that points close to each other on the manifold are mapped close to each other in the low dimensional space by preserving local distances: ( ) Minimization is formulated as the construction of eigenmaps. Algorithm computes eigen values and vectors of Laplacian Matrix where is diagonal weight matrix constructed from weight matrix as. Classification Techniques There are various techniques that are used for classifying a new data by the help of existing data. In this project we have also utilized from nearest neighbor and support vector machine classification algorithms, in order to classify microarray data. Nearest Neighbor Classification knearest Neighbor is a method for classifying objects according to closest training data. [1]. Test data is classified by finding the nearest k train data and find the majority of this k data, then classify test data as the owning class of majority of the group. This is explained in detail in [2]. Suppose that there are two classes as plus and minus. We already have some training data which belongs to these classes. Our task is to classify the query point as plus or minus. This is depicted in Figure 4. Figure 4. knearest Neighbor Classification If k is equal to one then this is called 1nearest neighbor or just nearest neighbor. According to this condition, The query point which is in red color in the figure, will find its closest point as plus and it will be classified as plus. If we increase k to 2, this time it will not be able to be classified since the second closest point will be minus. Both plus and minus classes will get the same score. If k will
7 be 5, then it will find the region of circle in the figure. As it is obvious, it will find 2 plus and 3 minus data. As the majority belongs to minus class, then red point will be classified as minus. Support Vector Machines Support Vector Machine classification technique based on the idea of decision planes [3]. A decision plane separates between a set of objects that belongs to different classes. An example of a decision plane is depicted in Figure 5. This plane has two classes of objects which belong to red or green color. The separating line is called decision boundary. The objects that are at the right side of this line belongs green class whereas at left side red objects presents. Any new object which is white at right side will be labeled as green, whereas at left side will be labeled as red. Figure 5. Decision Plane and Linear Classifier This is a classic linear classifier; however it is not that much in most of the cases of classification. More complex structure is needed in order to construct optimal separation. This is also shown in Figure 6. In order to classify red and green objects correctly, we would require a curve which is not that much simple as a line. This type of separating lines is known as hyper plane classifiers. Support Vector Machines are generated to deal such type of tasks. Figure 6. Hyper plane Classifier The basic idea of Support Vector Machines is depicted in Figure 7. The original objects are mapped by the help of some mathematical functions known as kernel functions. This mapping process is called transformation. The mapped version, it is much easier since a linear classifier is enough to separate two classes. Instead of constructing the complex curve, we only need to find an optimal line which will separate green and red objects. Figure 7. Transformation
8 Experiments and Results Datasets Dataset Classes Samples Dimensionality AML Prognosis Breast and Colon DLBCL Leukemia Prostate Colon Cancer (I2000) Remission 28 Relapse 26 Breast 31 Colon 21 DLBCL 58 FL 19 ALL 47 AML 25 Normal 50 Prostate 52 Tumor 42 Normal AML Prognosis (GSE2191): The classification problem on this dataset is defined as distinguishing patients with acute myeloid leukemia (AML) according to their prognosis after treatment (remission or relapse of disease). Most of the patients enter complete remission after treatment, but a significant number of them experience relapse with resistant disease. Breast and Colon (GSE3726): Predictive gene signatures can be measured with samples stored in RNAlater which preserves RNA. In this dataset, there are a number of breast or colon specific genes that are predictive of the relapse status. Frozen samples from the original dataset are used to distinguish between colon and breast cancer patients based on gene expression values. DLBCL: Diffuse large Bcell lymphomas (DLBCL) and follicular lymphomas (FL) are two B cell lineage malignancies that have very different clinical presentations, natural histories and response to therapy. However, FLs frequently evolve over time and acquire the morphologic and clinical features of DLBCLs and some subsets of DLBCLs have chromosomal translocations characteristic of FLs. Aim of the geneexpression based classification is to distinguish between these two lymphomas. Leukemia: This dataset contains information on geneexpression in samples from human AML and acute lymphoblastic leukemia (ALL). Prostate: This dataset contains gene expression measurements for samples of prostate tumors and adjacent prostate tissue not containing tumor. Colon Cancer: This dataset contains gene expressions with highest minimal intensity across tumor and normal colon tissues.
9 Experimental Setup There are two different parameters for IM, LLE, and LEM: reduced dimensionality k and number of neighboring nodes n. In order to optimize these parameters, we perform a grid search on intervals suggested in the literature. For k, we search from 2 to 15, and we search from 4 to 16 for n. As shown in an example table below, we obtain accuracies for each combination, which correspond to cells of the grid. After table is filled, we find the cell that contains the maximum accuracy and use its row and column values as k and n values, respectively. In that particular case, maximum value is obtained when n equals 4 and k equals either 10 or 11. If there is more than one cell that contains the maximum, we take the minimum indices for performance issues. According to this table, we select 10 for k and 4 for n. k\n
10 Results In our experiments, we use two different experimental setups to obtain the accuracies. In this section, we explain the details of these setups, and present our results for each dataset. First one is kfold cross validation. For each cell of the grid above, we k times divide the data into two as train and test data in a way that number of test samples is always 10. For example, there are 54 samples in AML dataset. For each parameter combination, we permute the samples, and use 44 samples as train data and 10 as test data. In that case, there are 5 folds and it is a 5fold cross validation. Accuracy is reported as the average of accuracies obtained at each fold. For this setup, we obtain results for PCA, IM, LLE, LEM and use nearest neighbor as classifier. Results are shown in the table below. Highest results for each dataset are highlighted as red. According to this table, values of reduced dimensionality and neighborhood highly vary for different datasets. However, optimized values of these parameters seem close for different techniques applied to the same dataset. For example, in case of AML Prognosis, reduced dimensionality is either 3 or 2 except IM. From that, we can infer that this data can be projected into a very small dimensionality compared to its original dimensionality (12625) and accuracies around 6070% can be achieved in that space. On the other hand, IM produces the best results on this dataset where k equals to 9. Dataset Method Dimension Neighborhood Accuracy (%) PCA AML Prognosis IM LLE LEM PCA Breast&Colon IM LLE LEM PCA DLBCL IM LLE LEM PCA Leukemia IM LLE LEM Prostate PCA IM LLE
11 LEM PCA Colon IM LLE LEM Nonlinear methods beat the PCA in almost all datasets. Accuracies obtained with PCA is approximately 10% less than other methods in some of the datasets that can be considered less separable including AML Prognosis and Colon datasets. Some other datasets like Breast & Colon, Leukemia generally have high accuracies for all of the methods used. Even for these datasets, three nonlinear methods always have higher accuracy values compared to PCA. This shows the importance of modeling nonlinearity in microarray datasets. As for the comparison of nonlinear techniques with each other, IM and LLE have a clear superiority to LEM. For all of the datasets, highest accuracies are achieved by either LLE or IM. Following the literature, LEM produce the most varying results. Second setup is leaveoneout cross validation. This time, we numberofsamples times divide the data into two as train and test in a way that each time a different sample is used as test data and remaining ones are used as training data. Accuracy is reported as the average of accuracies obtained at each run. For this setup, we obtain results for IM, LLE, and LEM and use SVM with linear kernel as classifier. Results are shown in the table below. Highest results that are highlighted as red are obtained with LLE for each dataset. Following LLE, IM and LEM produce similar results. One interesting point, LEM is able to reach the same, high results with IM and LLE, when SVM is used for classification instead of nearest neighbor. Even though it is not completely fair to compare these results with the previous table, since they use different experimental setups, we observe that results of the second setup are higher for each dataset, except Breast & Colon dataset and Colon dataset. Results are worse compared to the first setup in case of Breast & Colon dataset, and almost same accuracies are obtained in case of Colon dataset. Dataset Method Dimensionality Neighborhood Accuracy (%) AML Prognosis IM LLE LEM Breast&Colon IM LLE LEM DLBCL IM LLE LEM
12 Leukemia IM LLE LEM Prostate IM LLE LEM Colon IM LLE LEM
13 Visualizations In this section, we project different datasets by using each dimensionality reduction technique once. We use the second experimental setup for visualizations presented in this section. In this setup, IM achieves the highest accuracy on Leukemia dataset when reduced dimensionality is 2. As can be seen from the figure below, data is linearly separable in 2D space. Similarly, Breast & Colon dataset is linearly separable, and LLE achieves the highest accuracy on this dataset, when reduced dimensionality is 2. Results of LEM and PCA on Colon dataset and Prostate dataset are shown, respectively. In these cases, data points are not clearly separable, confirming the results from the previous section. Especially, result of PCA is too messy. PCA is only good at representing data lying on a linear subspace, since it is linear method. However, these visualizations show that microarray data that we use has a more complex structure than linear, and it cannot be captured by PCA projecting the data into a very low dimensional space. On the other hand, nonlinear methods, especially LLE and IM preserve more information about the data such as locality that shows neighborhood relationships. These local relationships construct the intrinsic geometric properties of the data, that nonlinear methods are designed to recover in a lower dimensionality. LLE onbreast and Colon Dataset LEM on Colon Dataset IM on Leukemia PCA on Prostate
14 Conclusion In this project, we compared four different, one linear and three nonlinear, dimensionality reduction techniques by using two different classifiers. We optimized the parameters of dimensionality reduction techniques by using cross validation and presented our results with two different setups for six different datasets. We observed a significant decrease with PCA compared to nonlinear methods. LLE and IM showed the best performances across setups and datasets in a consistent manner with the literature. These two methods preserve the underlying structure of the data in a better way than other methods when data projected to a new space with a very small dimensionality compared to its original dimensionality.
15 References 1) KNearest Neighbor Algorithm. Retrieved January 14, 2013, from 2) KNearest Neighbors. Retrieved January 14, 2013, from 3) Support Vector Machines. Retrieved January 14, 2013, from 4) Ethem Alpaydın. Introduction to Machine Learning, second edition. The MIT Press.
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationA Survey on Preprocessing and Postprocessing Techniques in Data Mining
, pp. 99128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Preprocessing and Postprocessing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationDistance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I  Applications Motivation and Introduction Patient similarity application Part II
More informationSo which is the best?
Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which
More informationSupervised and unsupervised learning  1
Chapter 3 Supervised and unsupervised learning  1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationManifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and treebased classification techniques.
More informationPCA, Clustering and Classification. By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker
PCA, Clustering and Classification By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker Motivation: Multidimensional data Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342
More informationFace Recognition using Principle Component Analysis
Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationGalaxy Morphological Classification
Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationMachine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu
Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? MerriamWebster: learn = to acquire knowledge, understanding, or skill
More informationVisualization of Large Font Databases
Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden Martin.Solli@itn.liu.se, Reiner.Lenz@itn.liu.se
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationGoing Big in Data Dimensionality:
LUDWIG MAXIMILIANS UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM 10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More information1 Spectral Methods for Dimensionality
1 Spectral Methods for Dimensionality Reduction Lawrence K. Saul Kilian Q. Weinberger Fei Sha Jihun Ham Daniel D. Lee How can we search for low dimensional structure in high dimensional data? If the data
More informationModel Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, MayJun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel MartínMerino Universidad
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationClassifiers & Classification
Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data preprocessing The data given
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationA Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 146428648, U.S.A.
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationExploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization
More informationIntroduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones
Introduction to machine learning and pattern recognition Lecture 1 Coryn BailerJones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, MayJune 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationAssessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
More informationClustering of Leukemia Patients via Gene Expression Data Analysis
University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 121526 Clustering of Leukemia Patients via Gene Expression Data Analysis Zhiyu Zhao
More informationAnalecta Vol. 8, No. 2 ISSN 20647964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationDimensionality Reduction A Short Tutorial
Dimensionality Reduction A Short Tutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract  This paper presents,
More informationT61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationAn unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis
An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis Roberto Avogadri 1, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,Via
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationMusic Classification by Composer
Music Classification by Composer Janice Lan janlan@stanford.edu CS 229, Andrew Ng December 14, 2012 Armon Saied armons@stanford.edu Abstract Music classification by a computer has been an interesting subject
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of BarIlan University RamatGan,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationSelection of the Suitable Parameter Value for ISOMAP
1034 JOURNAL OF SOFTWARE, VOL. 6, NO. 6, JUNE 2011 Selection of the Suitable Parameter Value for ISOMAP Li Jing and Chao Shao School of Computer and Information Engineering, Henan University of Economics
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationPCA to Eigenfaces. CS 510 Lecture #16 March 23 th A 9 dimensional PCA example
PCA to Eigenfaces CS 510 Lecture #16 March 23 th 2015 A 9 dimensional PCA example is dark around the edges and bright in the middle. is light with dark vertical bars. is light with dark horizontal bars.
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d dimensional subspace Axes of this subspace
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationIntroduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationPixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationHDDVis: An Interactive Tool for High Dimensional Data Visualization
HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationFace Recognition using SIFT Features
Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis DensityBased Cluster Analysis Cluster Evaluation Constrained
More informationKnowledge Discovery and Data Mining. Structured vs. NonStructured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. NonStructured Data Most business databases contain structured data consisting of welldefined fields with numeric or alphanumeric values.
More informationConcepts in Machine Learning, Unsupervised Learning & Astronomy Applications
Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications ChingWa Yip cwyip@pha.jhu.edu; Bloomberg 518 Human are Great Pattern Recognizers
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationAn Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,
More informationADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
More informationHIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM
HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Email:
More informationTrees and Random Forests
Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG03739201 Cache Valley, Utah Utah State University Leo
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationA New Method for Dimensionality Reduction using K Means Clustering Algorithm for High Dimensional Data Set
A New Method for Dimensionality Reduction using K Means Clustering Algorithm for High Dimensional Data Set D.Napoleon Assistant Professor Department of Computer Science Bharathiar University Coimbatore
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationCharacter Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
More informationVisualization of Topology Representing Networks
Visualization of Topology Representing Networks Agnes VathyFogarassy 1, Agnes WernerStark 1, Balazs Gal 1 and Janos Abonyi 2 1 University of Pannonia, Department of Mathematics and Computing, P.O.Box
More informationHighPerformance Signature Recognition Method using SVM
HighPerformance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of
More informationKnearestneighbor: an introduction to machine learning
Knearestneighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification:
More informationEvolutionary Tuning of Combined Multiple Models
Evolutionary Tuning of Combined Multiple Models Gregor Stiglic, Peter Kokol Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia {Gregor.Stiglic, Kokol}@unimb.si
More informationTDA and Machine Learning: Better Together
TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise
More informationData Mining Fundamentals
Part I Data Mining Fundamentals Data Mining: A First View Chapter 1 1.11 Data Mining: A Definition Data Mining The process of employing one or more computer learning techniques to automatically analyze
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationDEVELOPING AN IMAGE RECOGNITION ALGORITHM FOR FACIAL AND DIGIT IDENTIFICATION
DEVELOPING AN IMAGE RECOGNITION ALGORITHM FOR FACIAL AND DIGIT IDENTIFICATION ABSTRACT Christian Cosgrove, Kelly Li, Rebecca Lin, Shree Nadkarni, Samanvit Vijapur, Priscilla Wong, Yanjun Yang, Kate Yuan,
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More information