Feature Selection vs. Extraction

Similar documents

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Environmental Remote Sensing GEOG 2021

Biometric Authentication using Online Signatures

Linear Threshold Units

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Unsupervised and supervised dimension reduction: Algorithms and connections

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Similarity-Based Unsupervised Band Selection for Hyperspectral Image Analysis

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Component Ordering in Independent Component Analysis Based on Data Power

Statistical Machine Learning

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Sub-class Error-Correcting Output Codes

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Adaptive Face Recognition System from Myanmar NRC Card

ADVANCED MACHINE LEARNING. Introduction

A Survey on Pre-processing and Post-processing Techniques in Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Social Media Mining. Data Mining Essentials

Using multiple models: Bagging, Boosting, Ensembles, Forests

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Determining optimal window size for texture feature extraction methods

D-optimal plans in observational studies

Machine Learning and Pattern Recognition Logistic Regression

Predictive Modeling Techniques in Insurance

CITY UNIVERSITY OF HONG KONG 香港城市大學. Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理

High-Dimensional Data Visualization by PCA and LDA

The Artificial Prediction Market

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Spam Detection A Machine Learning Approach

Accurate and robust image superresolution by neural processing of local image representations

Data Preprocessing. Week 2

The Scientific Data Mining Process

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Unsupervised learning: Clustering

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Final Project Report

Visualization by Linear Projections as Information Retrieval

TOWARD BIG DATA ANALYSIS WORKSHOP

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Machine Learning for Data Science (CS4786) Lecture 1

Neural Networks Lesson 5 - Cluster Analysis

A Review of Missing Data Treatment Methods

Decision Trees from large Databases: SLIQ

A Content based Spam Filtering Using Optical Back Propagation Technique

AdaBoost for Learning Binary and Multiclass Discriminations. (set to the music of Perl scripts) Avinash Kak Purdue University. June 8, :20 Noon

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Chapter ML:XI (continued)

Face detection is a process of localizing and extracting the face region from the

Tracking and Recognition in Sports Videos

Predict Influencers in the Social Network

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

Unsupervised Data Mining (Clustering)

Machine Learning using MapReduce

Synthetic Aperture Radar: Principles and Applications of AI in Automatic Target Recognition

Evaluation & Validation: Credibility: Evaluating what has been learned

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Categorical Data Visualization and Clustering Using Subjective Factors

Data, Measurements, Features

CS Introduction to Data Mining Instructor: Abdullah Mueen

Dimensionality Reduction: Principal Components Analysis

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

Data quality in Accounting Information Systems

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Chapter 12 Discovering New Knowledge Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Supervised Learning (Big Data Analytics)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Towards better accuracy for Spam predictions

Classification Techniques for Remote Sensing

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Binary Logistic Regression

Mahalanobis Distance Map Approach for Anomaly Detection

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY Principal Components Null Space Analysis for Image and Video Classification

Independent component ordering in ICA time series analysis

Character Image Patterns as Big Data

Feature Selection for Stock Market Analysis

Why do statisticians "hate" us?

Data Mining Algorithms Part 1. Dejan Sarka

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Transcription:

Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small number of discriminative features To avoid curse of dimensionality To reduce feature measurement cost To reduce computational burden Given an n x d pattern matrix (n patterns in d- dimensional space), generate an n x m pattern matrix, where m << d

Feature Selection vs. Extraction Both are collectively known as dimensionality reduction Selection: choose the best subset of size m from the available d features Extraction: given d features (set Y), extract m new features (set X) by linear or non-linear combination of all the d features Linear feature extraction: X = TY, where T is a m x d matrix Non-linear feature extraction: X = f(y) New extracted features may not have physical interpretation or meaning Examples of linear feature extraction PCA (unsupervised); LDA (supervised) Goals for feature selection & extraction: simplify classifier complexity and improve the classification accuracy,

Feature Selection How to find the best subset of size m? Recall, best means a classifier based on these m features has the lowest probability of error Simplest approach an exhaustive search: (i) select a criterion fn., (ii) evaluate ALL possible subsets Computationally prohibitive For d=24 and m=12, there are about 2.7 million possible feature subsets! Cover & Van Campenhout (IEEE SMC, 1977) showed that to guarantee the best subset of size m from the available set of size d, one must examine all possible subsets of size m Heuristics have been used to avoid exhaustive search How to evaluate the subsets? Error rate; but then which classifier should be used? Distance measure; Mahalanobis, divergence, Feature selection is an optimization problem

Feature Selection: Evaluation, Application, and Small Sample Performance (Jain & Zongker, IEEE Trans. PAMI, Feb 1997) Feature selection enables combining features from different data models Potential difficulties in feature selection (i) small sample size, (ii) what criterion function to use Let Y be the original set of features and X is the selected subset Feature selection criterion for the set X is J(X); large value of J indicates a better feature subset; problem is to find subset X such that 3

Taxonomy of Feature Selection Algorithms

Deterministic Single-Solution Methods Begin with a single solution (feature subset) & iteratively add or remove features until some termination criterion is met Also known as sequential methods Bottom up (forward method): begin with an empty set & add features Top-down (backward method): begin with a full set & delete features These greedy methods do not examine all possible subsets, so no guarantee of finding the optimal subset Pudil introduced two floating selection methods: SFFS, SFBS 15 feature selection methods listed in Table 1 were evaluated 5

Naiive Method Sort the given d features in order of their prob. of correct recognition Select the top m features from this sorted list Disadvantage: Feature correlation is not considered; best pair of features may not even contain the best individual feature

Sequential Forward Selection (SFS) Start with the empty set, X=0 Repeatedly add the most significant feature with respect to X Disadvantage: Once a feature is retained, it cannot be discarded; nesting problem

Sequential Backward Selection (SBS) Start with the full set, X=Y Repeatedly delete the least significant feature in X Disadvantage: SBS requires more computation than SFS; Nesting problem

Generalized Sequential Forward Selection (GSFS(m)) Start with the empty set, X=0 Repeatedly add the most significant m- subset of (Y - X) (found through exhaustive search)

Generalized Sequential Backward Selection (GSBS(m)) Start with the full set, X=Y Repeatedly delete the least significant m- subset of X (found through exhaustive search)

Sequential Forward Floating Search (SFFS) Step 1: Inclusion. Use the basic SFS method to select the most significant feature with respect to X and include it in X. Stop if d features have been selected, otherwise go to step 2. Step 2: Conditional exclusion. Find the least significant feature k in X. If it is the feature just added, then keep it and return to step 1. Otherwise, exclude the feature k. Note that X is now better than it was before step 1. Continue to step 3. Step 3: Continuation of conditional exclusion. Again find the least significant feature in X. If its removal will (a) leave X with at least 2 features, and (b) the value of J(X) is greater than the criterion value of the best feature subset of that size found so far, then remove it and repeat step 3. When these two conditions cease to be satisfied, return to step 1.

Experimental Results 20-dimensional 2-class Gaussian data with a common covariance matrix Goodness of features (criterion function) is measured by the Mahalanobis distance; recall that in the common covariance matrix case, the prob. of error is inversely proportional to the Mahalanobis distance Forward search methods are faster than backward methods Performance of floating method (SFFS) is comparable to branch & bound methods, but SFFS is significantly faster Nesting problem: The optimal three feature subset is not contained in the optimal four feature subset! Feature selection is an off-line process, so large cost during training can be afforded 12

Experimental Results 13

Selection of Texture Features Selection of texture features for classifying Synthetic Aperture Radar (SAR) images A total of 18 different features/pixel from 4 different models Can classification error be reduced by feature selection 22,000 samples (pixels) from 5 classes; equally split for training/test Can the classification error be reduced by feature selection? 14

Performance of SFFS on Texture Features Best individual texture model for this data is the MAR model Pooling features from different models and then applying feature selection results in an accuracy of 89.3% by the 1NN method Selected subset has representative feature from every model; 5- feature subset selected contains features from 3 different models 15

Effect of Training Set Size on Feature Selection How reliable are the feature selection results in small sample size situations? Would the error is estimating covariance (for Mahalanobis distance) affect the feature selection performance? Feature selection on the Trunk data with varying sample size 20-dim data from distributions below; n in [10, 5000] Feature selection quality: no. of common features in the subset selected by SFFS and by the optimal method For n = 20, B&B selected the subset {1,2,4,7,9,12,13,14,15,18}; optimal subset is {1,2,3,4,5,6,7,8,9,10} 16