Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
|
|
- Rudolf Casey
- 8 years ago
- Views:
Transcription
1 CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear Dimensionality Reduction, Principal Components Analysis, Isomap, Locally Linear Embedding, Laplacian Eigenmaps, Support Vector Machines, Nearest Neighbor Classification, Leave-one-out Cross Validation, k-fold Cross Validation, Cancer Microarray Data.
2 Table of Contents Introduction...2 Methodology...3 Dimensionality Reduction Techniques...3 Isomap...3 Locally Linear Embedding...4 Laplacian Eigenmaps...5 Classification Techniques...5 Nearest Neighbor Classification...5 Support Vector Machines...6 Experiments and Results...7 Datasets...7 Experimental Setup...8 Results...9 Visualizations Conclusion... 13
3 Introduction One particular property of microarray data is that number of variables is much larger than the number of samples. Other is that correlations between variables are complex and remain unknown which makes harder the direct usage of machine learning algorithms on the data. There is always a high possibility of the singularity and overfitting. Researchers develop various methods to overcome these problems on microarray data. They either select combinations of genes based on some strategies which is called as gene selection or learns the underlying structure of the data and projects it to a lower dimensional and generally more discriminative space by using dimensionality reduction techniques. In this project, we compare the results of a set of dimensionality reduction techniques for the classification of gene expression microarray data. Various classical dimensionality reduction techniques like Principal Components Analysis (PCA) were proven to be successful in the previous studies. In this study, we compare the results of these techniques with a set of non-linear dimensionality techniques including Non-linear ISOMAP, Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). After dimensionality reduction, classification methods can be applied on the projected data in the low dimensional space. We employ simple nearest neighbor classification and Support Vector Machines (SVMs) for classification of the projected data. Prior to classification, we perform two types of cross validation to optimize the parameters of dimensionality reduction techniques, k-fold cross validation in case of nearest neighbor and leave-one-out cross validation in case of SVMs. We present the results on six different cancer microarray datasets: AML Prognosis, Breast and Colon Cancer, Lymphoma (DLBCL vs. FL), Leukemia, Prostate Cancer, and Colon Cancer datasets.
4 Methodology Dimensionality Reduction Techniques In a classification problem, complexity of the algorithm and number of samples necessary to train the classifier is highly dependent on the number of variables. In addition to the decrease in complexity, there are many other advantages of reducing dimensionality. A smaller dimensionality in input space leads to a simpler model which is robust against variance in data caused by noise, outliers, etc. As in case of the microarray data, smaller dimensionality also enables visualization by projecting data into a lower space, which is 2D or 3D. Dimensionality reduction can be performed as either feature selection or feature extraction. In case of feature selection, a number of relevant features are selected and used for classification, or for any other purposes. In case of feature extraction, data is projected into a lower subspace which is represented as the combinations of the original dimensions. Feature extraction methods can be categorized as linear or non-linear and supervised or unsupervised. Principal Components Analysis (PCA), Factor Analysis (FA), and Multidimensional Scaling (MDS) are some of the best known linear and supervised techniques. Linear Discriminant Analysis is another linear method that is also supervised, since it uses output information. In this project, we study non-linear, unsupervised dimensionality reduction techniques including Isomap (IM), Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). In this section, we explain these techniques and also MDS in detail, since IM is basically a non-linear modification of MDS. Multidimensional Scaling (MDS) In MDS, is defined as the distance between and, and distances between each pair of points is given in advance. MDS projects data into a lower dimensional space by preserving these distances. For example, given road distances between cities, result of MDS approximates a map containing these cities. As in many dimensionality reduction techniques, MDS can be formulated as an optimization problem. MDS finds a mapping from d to k dimensions and aim in mapping is to minimize the error as the sum of distances between each pair of points in two dimensions:
5 ( ) ( ) Mapping is defined by ( ) where ( ) is the mapping function with parameter set. It can be a linear transformation as in or can be a non-linear mapping which is called as Sammon Mapping. Isomap (IM) Isomap considers Euclidean distance for close neighboring points and estimates geodesic distances for far away points. Geodesic distance is defined on a graph where nodes correspond to data points and edges connect neighboring data points. Neighborhood can be defined by a threshold on the distance or as. Distance between any two nodes and ( ) is defined as the length of the shortest path in the graph as geodesic distance. Isomap then uses MDS to compute the reduced-dimensional positions of all the points. Locally Linear Embedding (LLE) LLE recovers global nonlinear structure from locally linear fits. Idea behind LLE is to represent each local patch of mainfold linearly. This is achieved by writing each point as a linear weighted sum of its neighbors. Given that minimize the error function: and its neighbors ( ), aim is to find reconstruction weights ( ) ( ) reflects intrinsic geometric properties of data that we want to preserve in the new space. Nearby points in original space should remain nearby. ( )
6 Laplacian Eigenmaps (LEM) As in LLE and IM, LEM also builds an adjacency graph of neighboring nodes. Similarly, neighborhood can be defined based on a threshold or as. Weights of edges can be binary (1 if connected, 0 otherwise) or they can be defined as Heat kernel: LEM minimizes an error function to ensure that points close to each other on the manifold are mapped close to each other in the low dimensional space by preserving local distances: ( ) Minimization is formulated as the construction of eigenmaps. Algorithm computes eigen values and vectors of Laplacian Matrix where is diagonal weight matrix constructed from weight matrix as. Classification Techniques There are various techniques that are used for classifying a new data by the help of existing data. In this project we have also utilized from nearest neighbor and support vector machine classification algorithms, in order to classify microarray data. Nearest Neighbor Classification k-nearest Neighbor is a method for classifying objects according to closest training data. [1]. Test data is classified by finding the nearest k train data and find the majority of this k data, then classify test data as the owning class of majority of the group. This is explained in detail in [2]. Suppose that there are two classes as plus and minus. We already have some training data which belongs to these classes. Our task is to classify the query point as plus or minus. This is depicted in Figure 4. Figure 4. k-nearest Neighbor Classification If k is equal to one then this is called 1-nearest neighbor or just nearest neighbor. According to this condition, The query point which is in red color in the figure, will find its closest point as plus and it will be classified as plus. If we increase k to 2, this time it will not be able to be classified since the second closest point will be minus. Both plus and minus classes will get the same score. If k will
7 be 5, then it will find the region of circle in the figure. As it is obvious, it will find 2 plus and 3 minus data. As the majority belongs to minus class, then red point will be classified as minus. Support Vector Machines Support Vector Machine classification technique based on the idea of decision planes [3]. A decision plane separates between a set of objects that belongs to different classes. An example of a decision plane is depicted in Figure 5. This plane has two classes of objects which belong to red or green color. The separating line is called decision boundary. The objects that are at the right side of this line belongs green class whereas at left side red objects presents. Any new object which is white at right side will be labeled as green, whereas at left side will be labeled as red. Figure 5. Decision Plane and Linear Classifier This is a classic linear classifier; however it is not that much in most of the cases of classification. More complex structure is needed in order to construct optimal separation. This is also shown in Figure 6. In order to classify red and green objects correctly, we would require a curve which is not that much simple as a line. This type of separating lines is known as hyper plane classifiers. Support Vector Machines are generated to deal such type of tasks. Figure 6. Hyper plane Classifier The basic idea of Support Vector Machines is depicted in Figure 7. The original objects are mapped by the help of some mathematical functions known as kernel functions. This mapping process is called transformation. The mapped version, it is much easier since a linear classifier is enough to separate two classes. Instead of constructing the complex curve, we only need to find an optimal line which will separate green and red objects. Figure 7. Transformation
8 Experiments and Results Datasets Dataset Classes Samples Dimensionality AML Prognosis Breast and Colon DLBCL Leukemia Prostate Colon Cancer (I2000) Remission 28 Relapse 26 Breast 31 Colon 21 DLBCL 58 FL 19 ALL 47 AML 25 Normal 50 Prostate 52 Tumor 42 Normal AML Prognosis (GSE2191): The classification problem on this dataset is defined as distinguishing patients with acute myeloid leukemia (AML) according to their prognosis after treatment (remission or relapse of disease). Most of the patients enter complete remission after treatment, but a significant number of them experience relapse with resistant disease. Breast and Colon (GSE3726): Predictive gene signatures can be measured with samples stored in RNAlater which preserves RNA. In this dataset, there are a number of breast or colon specific genes that are predictive of the relapse status. Frozen samples from the original dataset are used to distinguish between colon and breast cancer patients based on gene expression values. DLBCL: Diffuse large B-cell lymphomas (DLBCL) and follicular lymphomas (FL) are two B- cell lineage malignancies that have very different clinical presentations, natural histories and response to therapy. However, FLs frequently evolve over time and acquire the morphologic and clinical features of DLBCLs and some subsets of DLBCLs have chromosomal translocations characteristic of FLs. Aim of the gene-expression based classification is to distinguish between these two lymphomas. Leukemia: This dataset contains information on gene-expression in samples from human AML and acute lymphoblastic leukemia (ALL). Prostate: This dataset contains gene expression measurements for samples of prostate tumors and adjacent prostate tissue not containing tumor. Colon Cancer: This dataset contains gene expressions with highest minimal intensity across tumor and normal colon tissues.
9 Experimental Setup There are two different parameters for IM, LLE, and LEM: reduced dimensionality k and number of neighboring nodes n. In order to optimize these parameters, we perform a grid search on intervals suggested in the literature. For k, we search from 2 to 15, and we search from 4 to 16 for n. As shown in an example table below, we obtain accuracies for each combination, which correspond to cells of the grid. After table is filled, we find the cell that contains the maximum accuracy and use its row and column values as k and n values, respectively. In that particular case, maximum value is obtained when n equals 4 and k equals either 10 or 11. If there is more than one cell that contains the maximum, we take the minimum indices for performance issues. According to this table, we select 10 for k and 4 for n. k\n
10 Results In our experiments, we use two different experimental setups to obtain the accuracies. In this section, we explain the details of these setups, and present our results for each dataset. First one is k-fold cross validation. For each cell of the grid above, we k times divide the data into two as train and test data in a way that number of test samples is always 10. For example, there are 54 samples in AML dataset. For each parameter combination, we permute the samples, and use 44 samples as train data and 10 as test data. In that case, there are 5 folds and it is a 5-fold cross validation. Accuracy is reported as the average of accuracies obtained at each fold. For this setup, we obtain results for PCA, IM, LLE, LEM and use nearest neighbor as classifier. Results are shown in the table below. Highest results for each dataset are highlighted as red. According to this table, values of reduced dimensionality and neighborhood highly vary for different datasets. However, optimized values of these parameters seem close for different techniques applied to the same dataset. For example, in case of AML Prognosis, reduced dimensionality is either 3 or 2 except IM. From that, we can infer that this data can be projected into a very small dimensionality compared to its original dimensionality (12625) and accuracies around 60-70% can be achieved in that space. On the other hand, IM produces the best results on this dataset where k equals to 9. Dataset Method Dimension Neighborhood Accuracy (%) PCA AML Prognosis IM LLE LEM PCA Breast&Colon IM LLE LEM PCA DLBCL IM LLE LEM PCA Leukemia IM LLE LEM Prostate PCA IM LLE
11 LEM PCA Colon IM LLE LEM Non-linear methods beat the PCA in almost all datasets. Accuracies obtained with PCA is approximately 10% less than other methods in some of the datasets that can be considered less separable including AML Prognosis and Colon datasets. Some other datasets like Breast & Colon, Leukemia generally have high accuracies for all of the methods used. Even for these datasets, three non-linear methods always have higher accuracy values compared to PCA. This shows the importance of modeling non-linearity in microarray datasets. As for the comparison of non-linear techniques with each other, IM and LLE have a clear superiority to LEM. For all of the datasets, highest accuracies are achieved by either LLE or IM. Following the literature, LEM produce the most varying results. Second setup is leave-one-out cross validation. This time, we number-of-samples times divide the data into two as train and test in a way that each time a different sample is used as test data and remaining ones are used as training data. Accuracy is reported as the average of accuracies obtained at each run. For this setup, we obtain results for IM, LLE, and LEM and use SVM with linear kernel as classifier. Results are shown in the table below. Highest results that are highlighted as red are obtained with LLE for each dataset. Following LLE, IM and LEM produce similar results. One interesting point, LEM is able to reach the same, high results with IM and LLE, when SVM is used for classification instead of nearest neighbor. Even though it is not completely fair to compare these results with the previous table, since they use different experimental setups, we observe that results of the second setup are higher for each dataset, except Breast & Colon dataset and Colon dataset. Results are worse compared to the first setup in case of Breast & Colon dataset, and almost same accuracies are obtained in case of Colon dataset. Dataset Method Dimensionality Neighborhood Accuracy (%) AML Prognosis IM LLE LEM Breast&Colon IM LLE LEM DLBCL IM LLE LEM
12 Leukemia IM LLE LEM Prostate IM LLE LEM Colon IM LLE LEM
13 Visualizations In this section, we project different datasets by using each dimensionality reduction technique once. We use the second experimental setup for visualizations presented in this section. In this setup, IM achieves the highest accuracy on Leukemia dataset when reduced dimensionality is 2. As can be seen from the figure below, data is linearly separable in 2D space. Similarly, Breast & Colon dataset is linearly separable, and LLE achieves the highest accuracy on this dataset, when reduced dimensionality is 2. Results of LEM and PCA on Colon dataset and Prostate dataset are shown, respectively. In these cases, data points are not clearly separable, confirming the results from the previous section. Especially, result of PCA is too messy. PCA is only good at representing data lying on a linear subspace, since it is linear method. However, these visualizations show that microarray data that we use has a more complex structure than linear, and it cannot be captured by PCA projecting the data into a very low dimensional space. On the other hand, non-linear methods, especially LLE and IM preserve more information about the data such as locality that shows neighborhood relationships. These local relationships construct the intrinsic geometric properties of the data, that non-linear methods are designed to recover in a lower dimensionality. LLE onbreast and Colon Dataset LEM on Colon Dataset IM on Leukemia PCA on Prostate
14 Conclusion In this project, we compared four different, one linear and three non-linear, dimensionality reduction techniques by using two different classifiers. We optimized the parameters of dimensionality reduction techniques by using cross validation and presented our results with two different setups for six different datasets. We observed a significant decrease with PCA compared to non-linear methods. LLE and IM showed the best performances across setups and datasets in a consistent manner with the literature. These two methods preserve the underlying structure of the data in a better way than other methods when data projected to a new space with a very small dimensionality compared to its original dimensionality.
15 References 1) K-Nearest Neighbor Algorithm. Retrieved January 14, 2013, from 2) K-Nearest Neighbors. Retrieved January 14, 2013, from 3) Support Vector Machines. Retrieved January 14, 2013, from 4) Ethem Alpaydın. Introduction to Machine Learning, second edition. The MIT Press.
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationMathematical Models of Supervised Learning and their Application to Medical Diagnosis
Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical
More informationA Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationSo which is the best?
Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which
More informationDistance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
More informationManifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationSupervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationGalaxy Morphological Classification
Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationVisualization of Large Font Databases
Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden Martin.Solli@itn.liu.se, Reiner.Lenz@itn.liu.se
More informationMachine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu
Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationGoing Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More information1 Spectral Methods for Dimensionality
1 Spectral Methods for Dimensionality Reduction Lawrence K. Saul Kilian Q. Weinberger Fei Sha Jihun Ham Daniel D. Lee How can we search for low dimensional structure in high dimensional data? If the data
More informationModel Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad
More informationA Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationExploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization
More informationAssessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationIntroduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationHDDVis: An Interactive Tool for High Dimensional Data Visualization
HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationDecision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationSelection of the Suitable Parameter Value for ISOMAP
1034 JOURNAL OF SOFTWARE, VOL. 6, NO. 6, JUNE 2011 Selection of the Suitable Parameter Value for ISOMAP Li Jing and Chao Shao School of Computer and Information Engineering, Henan University of Economics
More informationAnalecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationPixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationAn unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis
An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis Roberto Avogadri 1, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,Via
More informationAnalysis of gene expression data. Ulf Leser and Philippe Thomas
Analysis of gene expression data Ulf Leser and Philippe Thomas This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser:
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationEmployer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationAn Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation
An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,
More informationADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
More informationA New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set
A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set D.Napoleon Assistant Professor Department of Computer Science Bharathiar University Coimbatore
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationVisualization of Topology Representing Networks
Visualization of Topology Representing Networks Agnes Vathy-Fogarassy 1, Agnes Werner-Stark 1, Balazs Gal 1 and Janos Abonyi 2 1 University of Pannonia, Department of Mathematics and Computing, P.O.Box
More informationInternational Journal of Software and Web Sciences (IJSWS) www.iasir.net
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationTDA and Machine Learning: Better Together
TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise
More informationCross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
More informationEvolutionary Tuning of Combined Multiple Models
Evolutionary Tuning of Combined Multiple Models Gregor Stiglic, Peter Kokol Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia {Gregor.Stiglic, Kokol}@uni-mb.si
More information3D Model based Object Class Detection in An Arbitrary View
3D Model based Object Class Detection in An Arbitrary View Pingkun Yan, Saad M. Khan, Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida http://www.eecs.ucf.edu/
More informationHigh-Performance Signature Recognition Method using SVM
High-Performance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationdiagnosis through Random
Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento
More informationExploratory Data Analysis with MATLAB
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
More informationA Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks
A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationHIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM
HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Email:
More informationA fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More information