Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Size: px
Start display at page:

Download "Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data"

Transcription

1 CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear Dimensionality Reduction, Principal Components Analysis, Isomap, Locally Linear Embedding, Laplacian Eigenmaps, Support Vector Machines, Nearest Neighbor Classification, Leave-one-out Cross Validation, k-fold Cross Validation, Cancer Microarray Data.

2 Table of Contents Introduction...2 Methodology...3 Dimensionality Reduction Techniques...3 Isomap...3 Locally Linear Embedding...4 Laplacian Eigenmaps...5 Classification Techniques...5 Nearest Neighbor Classification...5 Support Vector Machines...6 Experiments and Results...7 Datasets...7 Experimental Setup...8 Results...9 Visualizations Conclusion... 13

3 Introduction One particular property of microarray data is that number of variables is much larger than the number of samples. Other is that correlations between variables are complex and remain unknown which makes harder the direct usage of machine learning algorithms on the data. There is always a high possibility of the singularity and overfitting. Researchers develop various methods to overcome these problems on microarray data. They either select combinations of genes based on some strategies which is called as gene selection or learns the underlying structure of the data and projects it to a lower dimensional and generally more discriminative space by using dimensionality reduction techniques. In this project, we compare the results of a set of dimensionality reduction techniques for the classification of gene expression microarray data. Various classical dimensionality reduction techniques like Principal Components Analysis (PCA) were proven to be successful in the previous studies. In this study, we compare the results of these techniques with a set of non-linear dimensionality techniques including Non-linear ISOMAP, Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). After dimensionality reduction, classification methods can be applied on the projected data in the low dimensional space. We employ simple nearest neighbor classification and Support Vector Machines (SVMs) for classification of the projected data. Prior to classification, we perform two types of cross validation to optimize the parameters of dimensionality reduction techniques, k-fold cross validation in case of nearest neighbor and leave-one-out cross validation in case of SVMs. We present the results on six different cancer microarray datasets: AML Prognosis, Breast and Colon Cancer, Lymphoma (DLBCL vs. FL), Leukemia, Prostate Cancer, and Colon Cancer datasets.

4 Methodology Dimensionality Reduction Techniques In a classification problem, complexity of the algorithm and number of samples necessary to train the classifier is highly dependent on the number of variables. In addition to the decrease in complexity, there are many other advantages of reducing dimensionality. A smaller dimensionality in input space leads to a simpler model which is robust against variance in data caused by noise, outliers, etc. As in case of the microarray data, smaller dimensionality also enables visualization by projecting data into a lower space, which is 2D or 3D. Dimensionality reduction can be performed as either feature selection or feature extraction. In case of feature selection, a number of relevant features are selected and used for classification, or for any other purposes. In case of feature extraction, data is projected into a lower subspace which is represented as the combinations of the original dimensions. Feature extraction methods can be categorized as linear or non-linear and supervised or unsupervised. Principal Components Analysis (PCA), Factor Analysis (FA), and Multidimensional Scaling (MDS) are some of the best known linear and supervised techniques. Linear Discriminant Analysis is another linear method that is also supervised, since it uses output information. In this project, we study non-linear, unsupervised dimensionality reduction techniques including Isomap (IM), Locally Linear Embedding (LLE), and Laplacian Eigenmaps (LEM). In this section, we explain these techniques and also MDS in detail, since IM is basically a non-linear modification of MDS. Multidimensional Scaling (MDS) In MDS, is defined as the distance between and, and distances between each pair of points is given in advance. MDS projects data into a lower dimensional space by preserving these distances. For example, given road distances between cities, result of MDS approximates a map containing these cities. As in many dimensionality reduction techniques, MDS can be formulated as an optimization problem. MDS finds a mapping from d to k dimensions and aim in mapping is to minimize the error as the sum of distances between each pair of points in two dimensions:

5 ( ) ( ) Mapping is defined by ( ) where ( ) is the mapping function with parameter set. It can be a linear transformation as in or can be a non-linear mapping which is called as Sammon Mapping. Isomap (IM) Isomap considers Euclidean distance for close neighboring points and estimates geodesic distances for far away points. Geodesic distance is defined on a graph where nodes correspond to data points and edges connect neighboring data points. Neighborhood can be defined by a threshold on the distance or as. Distance between any two nodes and ( ) is defined as the length of the shortest path in the graph as geodesic distance. Isomap then uses MDS to compute the reduced-dimensional positions of all the points. Locally Linear Embedding (LLE) LLE recovers global nonlinear structure from locally linear fits. Idea behind LLE is to represent each local patch of mainfold linearly. This is achieved by writing each point as a linear weighted sum of its neighbors. Given that minimize the error function: and its neighbors ( ), aim is to find reconstruction weights ( ) ( ) reflects intrinsic geometric properties of data that we want to preserve in the new space. Nearby points in original space should remain nearby. ( )

6 Laplacian Eigenmaps (LEM) As in LLE and IM, LEM also builds an adjacency graph of neighboring nodes. Similarly, neighborhood can be defined based on a threshold or as. Weights of edges can be binary (1 if connected, 0 otherwise) or they can be defined as Heat kernel: LEM minimizes an error function to ensure that points close to each other on the manifold are mapped close to each other in the low dimensional space by preserving local distances: ( ) Minimization is formulated as the construction of eigenmaps. Algorithm computes eigen values and vectors of Laplacian Matrix where is diagonal weight matrix constructed from weight matrix as. Classification Techniques There are various techniques that are used for classifying a new data by the help of existing data. In this project we have also utilized from nearest neighbor and support vector machine classification algorithms, in order to classify microarray data. Nearest Neighbor Classification k-nearest Neighbor is a method for classifying objects according to closest training data. [1]. Test data is classified by finding the nearest k train data and find the majority of this k data, then classify test data as the owning class of majority of the group. This is explained in detail in [2]. Suppose that there are two classes as plus and minus. We already have some training data which belongs to these classes. Our task is to classify the query point as plus or minus. This is depicted in Figure 4. Figure 4. k-nearest Neighbor Classification If k is equal to one then this is called 1-nearest neighbor or just nearest neighbor. According to this condition, The query point which is in red color in the figure, will find its closest point as plus and it will be classified as plus. If we increase k to 2, this time it will not be able to be classified since the second closest point will be minus. Both plus and minus classes will get the same score. If k will

7 be 5, then it will find the region of circle in the figure. As it is obvious, it will find 2 plus and 3 minus data. As the majority belongs to minus class, then red point will be classified as minus. Support Vector Machines Support Vector Machine classification technique based on the idea of decision planes [3]. A decision plane separates between a set of objects that belongs to different classes. An example of a decision plane is depicted in Figure 5. This plane has two classes of objects which belong to red or green color. The separating line is called decision boundary. The objects that are at the right side of this line belongs green class whereas at left side red objects presents. Any new object which is white at right side will be labeled as green, whereas at left side will be labeled as red. Figure 5. Decision Plane and Linear Classifier This is a classic linear classifier; however it is not that much in most of the cases of classification. More complex structure is needed in order to construct optimal separation. This is also shown in Figure 6. In order to classify red and green objects correctly, we would require a curve which is not that much simple as a line. This type of separating lines is known as hyper plane classifiers. Support Vector Machines are generated to deal such type of tasks. Figure 6. Hyper plane Classifier The basic idea of Support Vector Machines is depicted in Figure 7. The original objects are mapped by the help of some mathematical functions known as kernel functions. This mapping process is called transformation. The mapped version, it is much easier since a linear classifier is enough to separate two classes. Instead of constructing the complex curve, we only need to find an optimal line which will separate green and red objects. Figure 7. Transformation

8 Experiments and Results Datasets Dataset Classes Samples Dimensionality AML Prognosis Breast and Colon DLBCL Leukemia Prostate Colon Cancer (I2000) Remission 28 Relapse 26 Breast 31 Colon 21 DLBCL 58 FL 19 ALL 47 AML 25 Normal 50 Prostate 52 Tumor 42 Normal AML Prognosis (GSE2191): The classification problem on this dataset is defined as distinguishing patients with acute myeloid leukemia (AML) according to their prognosis after treatment (remission or relapse of disease). Most of the patients enter complete remission after treatment, but a significant number of them experience relapse with resistant disease. Breast and Colon (GSE3726): Predictive gene signatures can be measured with samples stored in RNAlater which preserves RNA. In this dataset, there are a number of breast or colon specific genes that are predictive of the relapse status. Frozen samples from the original dataset are used to distinguish between colon and breast cancer patients based on gene expression values. DLBCL: Diffuse large B-cell lymphomas (DLBCL) and follicular lymphomas (FL) are two B- cell lineage malignancies that have very different clinical presentations, natural histories and response to therapy. However, FLs frequently evolve over time and acquire the morphologic and clinical features of DLBCLs and some subsets of DLBCLs have chromosomal translocations characteristic of FLs. Aim of the gene-expression based classification is to distinguish between these two lymphomas. Leukemia: This dataset contains information on gene-expression in samples from human AML and acute lymphoblastic leukemia (ALL). Prostate: This dataset contains gene expression measurements for samples of prostate tumors and adjacent prostate tissue not containing tumor. Colon Cancer: This dataset contains gene expressions with highest minimal intensity across tumor and normal colon tissues.

9 Experimental Setup There are two different parameters for IM, LLE, and LEM: reduced dimensionality k and number of neighboring nodes n. In order to optimize these parameters, we perform a grid search on intervals suggested in the literature. For k, we search from 2 to 15, and we search from 4 to 16 for n. As shown in an example table below, we obtain accuracies for each combination, which correspond to cells of the grid. After table is filled, we find the cell that contains the maximum accuracy and use its row and column values as k and n values, respectively. In that particular case, maximum value is obtained when n equals 4 and k equals either 10 or 11. If there is more than one cell that contains the maximum, we take the minimum indices for performance issues. According to this table, we select 10 for k and 4 for n. k\n

10 Results In our experiments, we use two different experimental setups to obtain the accuracies. In this section, we explain the details of these setups, and present our results for each dataset. First one is k-fold cross validation. For each cell of the grid above, we k times divide the data into two as train and test data in a way that number of test samples is always 10. For example, there are 54 samples in AML dataset. For each parameter combination, we permute the samples, and use 44 samples as train data and 10 as test data. In that case, there are 5 folds and it is a 5-fold cross validation. Accuracy is reported as the average of accuracies obtained at each fold. For this setup, we obtain results for PCA, IM, LLE, LEM and use nearest neighbor as classifier. Results are shown in the table below. Highest results for each dataset are highlighted as red. According to this table, values of reduced dimensionality and neighborhood highly vary for different datasets. However, optimized values of these parameters seem close for different techniques applied to the same dataset. For example, in case of AML Prognosis, reduced dimensionality is either 3 or 2 except IM. From that, we can infer that this data can be projected into a very small dimensionality compared to its original dimensionality (12625) and accuracies around 60-70% can be achieved in that space. On the other hand, IM produces the best results on this dataset where k equals to 9. Dataset Method Dimension Neighborhood Accuracy (%) PCA AML Prognosis IM LLE LEM PCA Breast&Colon IM LLE LEM PCA DLBCL IM LLE LEM PCA Leukemia IM LLE LEM Prostate PCA IM LLE

11 LEM PCA Colon IM LLE LEM Non-linear methods beat the PCA in almost all datasets. Accuracies obtained with PCA is approximately 10% less than other methods in some of the datasets that can be considered less separable including AML Prognosis and Colon datasets. Some other datasets like Breast & Colon, Leukemia generally have high accuracies for all of the methods used. Even for these datasets, three non-linear methods always have higher accuracy values compared to PCA. This shows the importance of modeling non-linearity in microarray datasets. As for the comparison of non-linear techniques with each other, IM and LLE have a clear superiority to LEM. For all of the datasets, highest accuracies are achieved by either LLE or IM. Following the literature, LEM produce the most varying results. Second setup is leave-one-out cross validation. This time, we number-of-samples times divide the data into two as train and test in a way that each time a different sample is used as test data and remaining ones are used as training data. Accuracy is reported as the average of accuracies obtained at each run. For this setup, we obtain results for IM, LLE, and LEM and use SVM with linear kernel as classifier. Results are shown in the table below. Highest results that are highlighted as red are obtained with LLE for each dataset. Following LLE, IM and LEM produce similar results. One interesting point, LEM is able to reach the same, high results with IM and LLE, when SVM is used for classification instead of nearest neighbor. Even though it is not completely fair to compare these results with the previous table, since they use different experimental setups, we observe that results of the second setup are higher for each dataset, except Breast & Colon dataset and Colon dataset. Results are worse compared to the first setup in case of Breast & Colon dataset, and almost same accuracies are obtained in case of Colon dataset. Dataset Method Dimensionality Neighborhood Accuracy (%) AML Prognosis IM LLE LEM Breast&Colon IM LLE LEM DLBCL IM LLE LEM

12 Leukemia IM LLE LEM Prostate IM LLE LEM Colon IM LLE LEM

13 Visualizations In this section, we project different datasets by using each dimensionality reduction technique once. We use the second experimental setup for visualizations presented in this section. In this setup, IM achieves the highest accuracy on Leukemia dataset when reduced dimensionality is 2. As can be seen from the figure below, data is linearly separable in 2D space. Similarly, Breast & Colon dataset is linearly separable, and LLE achieves the highest accuracy on this dataset, when reduced dimensionality is 2. Results of LEM and PCA on Colon dataset and Prostate dataset are shown, respectively. In these cases, data points are not clearly separable, confirming the results from the previous section. Especially, result of PCA is too messy. PCA is only good at representing data lying on a linear subspace, since it is linear method. However, these visualizations show that microarray data that we use has a more complex structure than linear, and it cannot be captured by PCA projecting the data into a very low dimensional space. On the other hand, non-linear methods, especially LLE and IM preserve more information about the data such as locality that shows neighborhood relationships. These local relationships construct the intrinsic geometric properties of the data, that non-linear methods are designed to recover in a lower dimensionality. LLE onbreast and Colon Dataset LEM on Colon Dataset IM on Leukemia PCA on Prostate

14 Conclusion In this project, we compared four different, one linear and three non-linear, dimensionality reduction techniques by using two different classifiers. We optimized the parameters of dimensionality reduction techniques by using cross validation and presented our results with two different setups for six different datasets. We observed a significant decrease with PCA compared to non-linear methods. LLE and IM showed the best performances across setups and datasets in a consistent manner with the literature. These two methods preserve the underlying structure of the data in a better way than other methods when data projected to a new space with a very small dimensionality compared to its original dimensionality.

15 References 1) K-Nearest Neighbor Algorithm. Retrieved January 14, 2013, from 2) K-Nearest Neighbors. Retrieved January 14, 2013, from 3) Support Vector Machines. Retrieved January 14, 2013, from 4) Ethem Alpaydın. Introduction to Machine Learning, second edition. The MIT Press.

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis Genomic, Proteomic and Transcriptomic Lab High Performance Computing and Networking Institute National Research Council, Italy Mathematical Models of Supervised Learning and their Application to Medical

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

So which is the best?

So which is the best? Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

PCA, Clustering and Classification. By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker

PCA, Clustering and Classification. By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker PCA, Clustering and Classification By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker Motivation: Multidimensional data Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342

More information

Face Recognition using Principle Component Analysis

Face Recognition using Principle Component Analysis Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

Galaxy Morphological Classification

Galaxy Morphological Classification Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill

More information

Visualization of Large Font Databases

Visualization of Large Font Databases Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden Martin.Solli@itn.liu.se, Reiner.Lenz@itn.liu.se

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

1 Spectral Methods for Dimensionality

1 Spectral Methods for Dimensionality 1 Spectral Methods for Dimensionality Reduction Lawrence K. Saul Kilian Q. Weinberger Fei Sha Jihun Ham Daniel D. Lee How can we search for low dimensional structure in high dimensional data? If the data

More information

Model Selection. Introduction. Model Selection

Model Selection. Introduction. Model Selection Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

Classifiers & Classification

Classifiers & Classification Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

A Computational Framework for Exploratory Data Analysis

A Computational Framework for Exploratory Data Analysis A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Clustering of Leukemia Patients via Gene Expression Data Analysis

Clustering of Leukemia Patients via Gene Expression Data Analysis University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 12-15-26 Clustering of Leukemia Patients via Gene Expression Data Analysis Zhiyu Zhao

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Dimensionality Reduction A Short Tutorial

Dimensionality Reduction A Short Tutorial Dimensionality Reduction A Short Tutorial Ali Ghodsi Department of Statistics and Actuarial Science University of Waterloo Waterloo, Ontario, Canada, 2006 c Ali Ghodsi, 2006 Contents 1 An Introduction

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis

An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis Roberto Avogadri 1, Giorgio Valentini 1 1 DSI, Dipartimento di Scienze dell Informazione, Università degli Studi di Milano,Via

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Music Classification by Composer

Music Classification by Composer Music Classification by Composer Janice Lan janlan@stanford.edu CS 229, Andrew Ng December 14, 2012 Armon Saied armons@stanford.edu Abstract Music classification by a computer has been an interesting subject

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Selection of the Suitable Parameter Value for ISOMAP

Selection of the Suitable Parameter Value for ISOMAP 1034 JOURNAL OF SOFTWARE, VOL. 6, NO. 6, JUNE 2011 Selection of the Suitable Parameter Value for ISOMAP Li Jing and Chao Shao School of Computer and Information Engineering, Henan University of Economics

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

PCA to Eigenfaces. CS 510 Lecture #16 March 23 th A 9 dimensional PCA example

PCA to Eigenfaces. CS 510 Lecture #16 March 23 th A 9 dimensional PCA example PCA to Eigenfaces CS 510 Lecture #16 March 23 th 2015 A 9 dimensional PCA example is dark around the edges and bright in the middle. is light with dark vertical bars. is light with dark horizontal bars.

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006 Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

HDDVis: An Interactive Tool for High Dimensional Data Visualization

HDDVis: An Interactive Tool for High Dimensional Data Visualization HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia mtan@cs.ubc.ca ABSTRACT Current high dimensional data visualization

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Face Recognition using SIFT Features

Face Recognition using SIFT Features Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications

Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518 Human are Great Pattern Recognizers

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation Shanofer. S Master of Engineering, Department of Computer Science and Engineering, Veerammal Engineering College,

More information

ADVANCED MACHINE LEARNING. Introduction

ADVANCED MACHINE LEARNING. Introduction 1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

More information

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Email:

More information

Trees and Random Forests

Trees and Random Forests Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set D.Napoleon Assistant Professor Department of Computer Science Bharathiar University Coimbatore

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Character Image Patterns as Big Data

Character Image Patterns as Big Data 22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,

More information

Visualization of Topology Representing Networks

Visualization of Topology Representing Networks Visualization of Topology Representing Networks Agnes Vathy-Fogarassy 1, Agnes Werner-Stark 1, Balazs Gal 1 and Janos Abonyi 2 1 University of Pannonia, Department of Mathematics and Computing, P.O.Box

More information

High-Performance Signature Recognition Method using SVM

High-Performance Signature Recognition Method using SVM High-Performance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of

More information

K-nearest-neighbor: an introduction to machine learning

K-nearest-neighbor: an introduction to machine learning K-nearest-neighbor: an introduction to machine learning Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1 Outline Types of learning Classification:

More information

Evolutionary Tuning of Combined Multiple Models

Evolutionary Tuning of Combined Multiple Models Evolutionary Tuning of Combined Multiple Models Gregor Stiglic, Peter Kokol Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia {Gregor.Stiglic, Kokol}@uni-mb.si

More information

TDA and Machine Learning: Better Together

TDA and Machine Learning: Better Together TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise

More information

Data Mining Fundamentals

Data Mining Fundamentals Part I Data Mining Fundamentals Data Mining: A First View Chapter 1 1.11 Data Mining: A Definition Data Mining The process of employing one or more computer learning techniques to automatically analyze

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

DEVELOPING AN IMAGE RECOGNITION ALGORITHM FOR FACIAL AND DIGIT IDENTIFICATION

DEVELOPING AN IMAGE RECOGNITION ALGORITHM FOR FACIAL AND DIGIT IDENTIFICATION DEVELOPING AN IMAGE RECOGNITION ALGORITHM FOR FACIAL AND DIGIT IDENTIFICATION ABSTRACT Christian Cosgrove, Kelly Li, Rebecca Lin, Shree Nadkarni, Samanvit Vijapur, Priscilla Wong, Yanjun Yang, Kate Yuan,

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information