An Efficient Greedy Method for Unsupervised Feature Selection



Similar documents
An Ensemble Classification Framework to Evolving Data Streams

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Multi-agent System for Custom Relationship Management with SVMs Tool

Clustering based Two-Stage Text Classification Requiring Minimal Training Data

An Efficient Job Scheduling for MapReduce Clusters

SIMPLIFYING NDA PROGRAMMING WITH PROt SQL

Distributed Column Subset Selection on MapReduce

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Approximation Algorithms for Data Distribution with Load Balancing of Web Servers

Predicting Advertiser Bidding Behaviors in Sponsored Search by Rationality Modeling

Support Vector Machines

Predictive Control of a Smart Grid: A Distributed Optimization Algorithm with Centralized Performance Properties*

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

8 Algorithm for Binary Searching in Trees

A Simple Congestion-Aware Algorithm for Load Balancing in Datacenter Networks

BERNSTEIN POLYNOMIALS

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

What is Candidate Sampling

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis

Recurrence. 1 Definitions and main statements

Forecasting the Direction and Strength of Stock Market Movement

The Greedy Method. Introduction. 0/1 Knapsack Problem

L10: Linear discriminants analysis

J. Parallel Distrib. Comput.

A Fast Incremental Spectral Clustering for Large Data Sets

Off-line and on-line scheduling on heterogeneous master-slave platforms

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Dynamic Virtual Network Allocation for OpenFlow Based Cloud Resident Data Center

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

DEFINING %COMPLETE IN MICROSOFT PROJECT

where the coordinates are related to those in the old frame as follows.

Loop Parallelization

Cardiovascular Event Risk Assessment Fusion of Individual Risk Assessment Tools Applied to the Portuguese Population

TCP/IP Interaction Based on Congestion Price: Stability and Optimality

Hacia un Modelo de Red Inmunológica Artificial Basado en Kernels. Towards a Kernel Based Model for Artificial Immune Networks

Greedy Column Subset Selection for Large-scale Data Sets

How To Calculate The Accountng Perod Of Nequalty

Prediction of Success or Fail of Students on Different Educational Majors at the End of the High School with Artificial Neural Networks Methods

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Expressive Negotiation over Donations to Charities

1. Measuring association using correlation and regression

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Neural Network-based Colonoscopic Diagnosis Using On-line Learning and Differential Evolution

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks

An Interest-Oriented Network Evolution Mechanism for Online Communities

Branch-and-Price and Heuristic Column Generation for the Generalized Truck-and-Trailer Routing Problem

INSTITUT FÜR INFORMATIK

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

1 Example 1: Axis-aligned rectangles

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Adaptive Fractal Image Coding in the Frequency Domain

Project Networks With Mixed-Time Constraints

Bayesian Cluster Ensembles

Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style

USING EMPIRICAL LIKELIHOOD TO COMBINE DATA: APPLICATION TO FOOD RISK ASSESSMENT.

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Swing-Free Transporting of Two-Dimensional Overhead Crane Using Sliding Mode Fuzzy Control

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

A Simple Approach to Clustering in Excel

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

A heuristic task deployment approach for load balancing

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Least Squares Fitting of Data

A DATA MINING APPLICATION IN A STUDENT DATABASE

2) A single-language trained classifier: one. classifier trained on documents written in

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Software Alignment for Tracking Detectors

Improved SVM in Cloud Computing Information Mining

Calculation of Sampling Weights

Increasing Supported VoIP Flows in WMNs through Link-Based Aggregation

Vehicle Routing Problem with Time Windows for Reducing Fuel Consumption

A Resources Allocation Model for Multi-Project Management

The Dynamics of Wealth and Income Distribution in a Neoclassical Growth Model * Stephen J. Turnovsky. University of Washington, Seattle

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Realistic Image Synthesis

Method for Production Planning and Inventory Control in Oil

Fast Fuzzy Clustering of Web Page Collections

A Prefix Code Matching Parallel Load-Balancing Method for Solution-Adaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers

Section 5.3 Annuities, Future Value, and Sinking Funds

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

SVM Tutorial: Classification, Regression, and Ranking

Active Learning for Interactive Visualization

Implementation of Boolean Functions through Multiplexers with the Help of Shannon Expansion Theorem

Transcription:

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng An Effcent Greedy Method for Unsupervsed Feature Seecton Ahmed K. Farahat A Ghods Mohamed S. Kame Unversty of Wateroo Wateroo, Ontaro, Canada N2L 3G1 Ema: {afarahat, aghodsb, mkame}@uwateroo.ca Abstract In data mnng appcatons, data nstances are typcay descrbed by a huge number of features. Most of these features are rreevant or redundant, whch negatvey affects the effcency and effectveness of dfferent earnng agorthms. he seecton of reevant features s a cruca task whch can be used to aow a better understandng of data or mprove the performance of other earnng tasks. Athough the seecton of reevant features has been extensvey studed n supervsed earnng, feature seecton wth the absence of cass abes s st a chaengng task. hs paper proposes a nove method for unsupervsed feature seecton, whch effcenty seects features n a greedy manner. he paper frst defnes an effectve crteron for unsupervsed feature seecton whch measures the reconstructon error of the data matrx based on the seected subset of features. he paper then presents a nove agorthm for greedy mnmzng the reconstructon error based on the features seected so far. he greedy agorthm s based on an effcent recursve formua for cacuatng the reconstructon error. Experments on rea data sets demonstrate the effectveness of the proposed agorthm n comparson to the state-of-the-art methods for unsupervsed feature seecton. Keywords-Feature Seecton; Greedy Agorthms; Unsupervsed Learnng I. INRODUCION Data nstances are typcay descrbed by a huge number of features. Most of these features are ether redundant, or rreevant to the data mnng task at hand. Havng a arge number of redundant and rreevant features negatvey affects the performance of the underyng earnng agorthms, and makes them more computatonay demandng. herefore, reducng the dmensonaty of the data s a fundamenta task for machne earnng and data mnng appcatons. hroughout past years, two approaches have been proposed for dmenson reducton; feature seecton, and feature extracton. Feature seecton aso known as varabe seecton or subset seecton) searches for a reevant subset of exstng features, whe feature extracton aso known as feature transformaton) earns a new set of features whch combnes exstng features. hese methods have been empoyed wth both supervsed and unsupervsed earnng, where n the case of supervsed earnng cass abes are used to gude the seecton or extracton of features. Feature extracton methods produce a set of contnuous vectors whch represent data nstances n the space of the extracted features. Accordngy, most of these methods obtan unque soutons n poynoma tme, whch make these methods more attractve n terms of computatona compexty. On the other hand, feature seecton s a combnatora optmzaton probem whch s NP-hard, and most feature seecton methods depend on heurstcs to obtan a subset of reevant features n a manageabe tme. Nevertheess, feature extracton methods usuay produce features whch are dffcut to nterpret, and accordngy feature seecton s more appeang n appcatons where understandng the meanng of features s cruca for data anayss. Feature seecton methods can be categorzed nto wrapper and fter methods. Wrapper methods wrap feature seecton around the earnng process and search for features whch enhance the performance of the earnng task. Fter methods, on the other hand, anayze the ntrnsc propertes of the data, and seect hghy-ranked features accordng to some crteron before dong the earnng task. Wrapper methods are computatonay more compex than fter methods as they depend on depoyng the earnng modes many tmes unt a subset of reevant features are found. hs paper presents an effectve fter method for unsupervsed feature seecton. he method s based on a nove crteron for feature seecton whch measures the reconstructon error of the data matrx based on the subset of seected features. he paper presents a nove recursve formua for cacuatng the crteron functon as we as an effcent greedy agorthm to seect features. he greedy agorthm seects at each teraton the most representatve feature among the remanng features, and then emnates the effect of the seected features from the data matrx. hs step makes t ess key for the agorthm to seect features that are smar to prevousy seected features, whch accordngy reduces the redundancy between the seected features. In addton, the use of the recursve crteron makes the agorthm computatonay feasbe and memory effcent compared to the state of the art methods for unsupervsed feature seecton. he rest of ths paper s organzed as foows. Secton II defnes the notatons used throughout the paper. Secton III dscusses prevous work on fter methods for unsupervsed feature seecton. Secton IV presents the proposed feature seecton crteron. Secton V presents a nove recursve formua for the feature seecton crteron. Secton VI proposes

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng an effectve greedy agorthm for feature seecton as we as memory and tme effcent varants of the agorthm. Secton VII presents an emprca evauaton of the proposed method. Fnay, Secton VIII concudes the paper. II. NOAIONS hroughout the paper, scaars, vectors, sets, and matrces are shown n sma, sma bod tac, scrpt, and capta etters, respectvey. In addton, the foowng notatons are used. For a vector x R p : x -th eement of x. x the Eucdean norm 2 -norm) of x. For a matrx A R p q : A j, j)-th entry of A. A : -th row of A. A :j j-th coumn of A. A S: the sub-matrx of A whch conssts of the set S of rows. A :S the sub-matrx of A whch conssts of the set S of coumns. Ã a ow rank approxmaton of A. Ã S a rank-k approxmaton of A based on the set S of coumns, where S = k. A F the Frobenus norm of A: A F = Σ,j A 2 j III. PREVIOUS WORK Many fter methods for unsupervsed feature seecton depend on the Prncpa Component Anayss PCA) method [1] to search for the most representatve features. PCA s the best-known method for unsupervsed feature extracton whch fnds drectons wth maxmum varance n the feature space namey prncpa components). he prncpa components are aso those drectons that acheve the mnmum reconstructon error for the data matrx. Joffe [1] suggests dfferent agorthms to use PCA for unsupervsed feature seecton. In these agorthms, features are frst assocated wth prncpa components based on the absoute vaue of ther coeffcents, and then features correspondng to the frst or ast) prncpa components are seected or deeted). hs can be done once or recursvey.e., by frst seectng or deetng some features and then recomputng the prncpa components based on the remanng features). Smary, sparse PCA [2], a varant of PCA whch produces sparse prncpa components, can aso be used for feature seecton. hs can be done by seectng for each prncpa component the subset of features wth non-zero coeffcents. However, Masae et a. [3] showed that these sparse coeffcents may be dstrbuted across dfferent features and accordngy are not aways usefu for feature seecton. Another teratve approach s suggested by Cu and Dy [4], n whch the feature that s most correated wth the frst prncpa component s seected, and then other features are projected onto the drecton orthogona to that feature. hese steps are repeated unt the requred number of features are seected. Lu et a. [5] suggests a dfferent PCA-based approach whch appes k-means custerng to the prncpa components, and then seects the features that are cose to custers centrods. Boutsds et a. [6], [7] propose a feature seecton method that randomy sampes features based on probabtes cacuated usng the k-eadng snguar vaues of the data matrx. In [6], random sampng s used to reduce the number of canddate features, and then the requred number of features s seected by appyng a compex subset seecton agorthm on the reduced matrx. In [7], the authors derve a theoretca guarantee for the error of the k-means custerng when features are seected usng random sampng. However, theoretca guarantees for other custerng agorthms were not expored n ths work. Recenty, Masae et a. [3] propose an agorthm caed Convex Prncpa Feature Seecton CPFS). CPFS formuates feature seecton as a convex contnuous optmzaton probem whch mnmzes the mean-squaredreconstructon error of the data matrx a PCA-ke crteron) wth sparsty constrants. hs s a quadratc programmng probem wth near constrants, whch was soved usng a projected quas-newton method. Another category of unsupervsed feature seecton methods are based on seectng features that preserve smartes between data nstances. Most of these methods frst construct a k nearest neghbor graph between data nstances, and then seect features that preserve the structure of that graph. Exampes for these methods ncude the Lapacan score ) [8] and the spectra feature seecton method a.k.a., ) [9]. he Lapacan score ) [8] cacuates a score for each feature based on the graph Lapacan and degree matrces. hs score quantfes how each feature preserves smarty between data nstances and ther neghbors n the graph. Spectra feature seecton [9] extends ths dea and presents a genera framework for rankng features on a k nearest neghbor graph. Some methods drecty seect features whch preserve the custer structure of the data. he Q α agorthm [10] measures the goodness of a subset of features based on the custerng quaty namey custer coherence) when data s represented usng ony those features. he authors defne a feature weght vector, and propose an teratve agorthm that aternates between cacuatng the custer coherence based on current weght vector and estmatng a new weght vector that maxmzes that coherence. hs agorthm converges to a oca mnmum of the custer coherence and produces a sparse weght vector that ndcates whch features shoud be seected. Recenty, Ca et a. [11] propose an agorthm caed Mut-Custer Feature Seecton ) whch seects a subset of features such that the mut-custer structure of the data s preserved. o acheve that, the authors empoy a method smar to spectra custerng [12], whch frst constructs a k nearest neghbor graph over the data nstances, and then soves a generazed egenprobem over the graph Lapacan and degree matrces. After that, for each

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng egenvector, an L1-reguarzed regresson probem s soved to represent each egenvector usng a sparse combnaton of features. Features are then assgned scores based on these coeffcents and hghy scored features are seected. he authors show expermentay that the agorthm outperforms Lapacan score SC) and the Q α agorthm. Another we-known approach for unsupervsed feature seecton s the Feature Seecton usng Feature Smarty ) method suggested by Mtra et a. [13]. he method groups features nto custers and then seects a representatve feature for each custer. o group features, the agorthm starts by cacuatng parwse smartes between features, and then t constructs a k nearest neghbor graph over the features. he agorthm then seects the feature wth the most compact neghborhood and removes a ts neghbors. hs process s repeated on the remanng features unt a features are ether seected or removed. he authors aso suggested a new feature smarty measure, namey maxma nformaton compresson, whch quantfes the mnmum amount of nformaton oss when one feature s represented by the other. In comparson to prevous work, the greedy feature seecton method proposed n ths paper uses a PCA-ke crteron whch mnmzes the reconstructon error of the data matrx based on the seected subset of features. In contrast to tradtona PCA-based methods, the proposed agorthm does not cacuate the prncpa components, whch s computatonay demandng. Unke Lapacan score ) [8] and ts extenson [9], the greedy feature seecton method does not depend on cacuatng parwse smarty between nstances. It aso does not cacuate egenvaue decomposton over the smarty matrx as the Q α agorthm [10] and Mut-Custer Feature Seecton ) [11] do. he feature seecton crteron presented n ths paper s smar to that of Convex Prncpa Feature Seecton CPFS) [3] as both mnmze the reconstructon error of the data matrx. Whe the method presented here uses a greedy agorthm to mnmze a dscrete optmzaton probem, CPFS soves a quadratc programmng probem wth sparsty constrants. In addton, the number of features seected by the CPFS depends on a reguarzaton parameter λ whch s dffcut to tune. Smar to the method proposed by Cu and Dy [4], the method presented n ths paper removes the effect of each seected feature by projectng other features to the drecton orthogona to that seected feature. However, the method proposed by Cu and Dy s computatonay very compex as t requres the cacuaton of the frst prncpa component for the whoe matrx after each teraton. he Feature Seecton usng Feature Smarty ) [13] method empoys a smar greedy approach whch seects the most representatve feature, and then emnates ts neghbors n the feature smarty graph. he method, however, depends on a computatonay compex measure for cacuatng smarty between features. As shown n Secton VII, experments on rea data sets show that the proposed agorthm outperforms the Feature Seecton usng Feature Smarty ) method [13], Lapacan score SC) [8], and Mut-Custer Feature Seecton ) [11] when apped wth dfferent custerng agorthms. IV. FEAURE SELECION CRIERION hs secton defnes a nove crteron for unsupervsed feature seecton. he crteron measures the reconstructon error of data matrx based on the seected subset of features. he goa of the proposed feature seecton agorthm s to seect a subset of features that mnmzes ths reconstructon error. Defnton 1: Unsupervsed Feature Seecton Crteron) Let A be an m n data matrx whose rows represent the set of data nstances and whose coumns represent the set of features. he feature seecton crteron s defned as: F S) = A P S) A 2 F where S s the set of the ndces of seected features, and P S) s an m m projecton matrx whch projects the coumns of A onto the span of the set S of coumns. he crteron F S) represents the sum of squared errors between orgna data matrx A and ts rank-k approxmaton based on the seected set of features where k = S ): Ã S = P S) A. 1) he projecton matrx P S) can be cacuated as: P S) = A :S A :S A :S ) 1 A :S 2) where A :S s the sub-matrx of A whch conssts of the coumns correspondng to S. It shoud be noted that f the subset of features S s known, the projecton matrx P S) s the cosed-form souton of the east-squares probem whch mnmzes F S). he goa of the feature seecton agorthm presented n ths paper s to seect a subset S of features such that F S) s mnmzed. Probem 1: Unsupervsed Feature Seecton) Fnd a subset of features L such that, L = arg mn F S). S hs s an NP-hard combnatora optmzaton probem. In Secton V, a recursve formua for the seecton crteron s presented. hs formua aows the deveopment of an effcent agorthm to greedy mnmze F S). he greedy agorthm s presented n Secton VI. V. RECURSIVE SELECION CRIERION In ths secton, a recursve formua s derved for the feature seecton crteron presented n Secton IV. hs formua s based on a recursve formua for the projecton matrx P S) whch can be derved as foows.

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng Lemma 1: Gven a set of features S. For any P S, P S) = P P) + R R) where R R) s a projecton matrx whch projects the coumns of E = A P P) A onto the span of the subset R = S \ P of coumns: R R) = E :R E :R E :R ) 1 E :R. Proof: Defne a matrx B = A :S A :S whch represents the nner-product over the coumns of the sub-matrx A :S. he projecton matrx P S) can be wrtten as: P S) = A :S B 1 A :S 3) Wthout oss of generaty, the coumns and rows of A :S and B n Eq. 3) can be rearranged such that the frst sets of rows and coumns correspond to P: A :S = [ [ ] ] BPP B A :P A :R, B = PR BPR B RR where B PP = A :P A :P, B PR = A :P A :R and B RR = A :R A :R. Let B RR B PR B 1 PP B PR be the Schur compement [14] of B PP n B. Use the bock-wse nverson formua [14] of B 1 and substtute wth A :S and B 1 n Eq. 3): P S) = [ A :P A :R ] [ B 1 PP + B 1 PP B PRS 1 B PR B 1 PP [ A :P A :R ] S 1 B PR B 1 PP he rght-hand sde can be smpfed to: B 1 PP B PRS 1 S 1 P S) = A :P B 1 PP A :P + A :R A :P B 1 PP B ) PR S 1 A :R BPRB 1 ] PP A :P he frst term of Eq. 4) s the projecton matrx whch projects the coumns of A onto the span of the subset P of coumns: P P) = A :P B 1 PP A :P. he second term can be smpfed as foows. Let E be an m n resdua matrx whch s cacuated as: E = A P P) A. It can be shown that E :R = A :R A :P B 1 PP B PR, and S = E:R E :R. Hence, the second term of Eq. 4) s the projecton matrx whch projects the coumns of E onto the span of the subset R of coumns: R R) ) = E :R E 1 :R E :R E :R. 5) hs proves that P S) can be wrtten n terms of P P) and R as: P S) = P P) + R R) hs means that projecton matrx P S) can be constructed n a recursve manner by frst cacuatng the projecton matrx whch projects the coumns of A onto the span of the subset P of coumns, and then cacuatng the projecton matrx whch projects the coumns of the resdua matrx 4) ) onto the span of the remanng coumns. Based on ths emma, a recursve formua can be deveoped for ÃS. Coroary 1: Gven a matrx A and a subset of coumns S. For any P S, Ã S = ÃP + ẼR where E = A P P) A, and ẼR s the ow-rank approxmaton of E based on the subset R = S \ P of coumns. Proof: Usng Lemma 1), and substtutng wth P S) n Eq. 1) gves: Ã S = P P) A + E :R E :R E :R ) 1 E :R A 6) he frst term s the ow-rank approxmaton of A based on P: Ã P = P P) A. he second term s equa to ẼR as E :R A = E :R E. o prove that, mutpyng E :R by E = A P P) A gves: E :RE = E :RA E :RP P) A. Usng E :R = A :R P P) A :R, the expresson E :R P P) can be wrtten as: E :RP P) = A :RP P) A :RP P) P P). hs s equa to 0 as P P) P P) = P P) A property of projecton matrces). hs means that E:R A = E :RE. Substtutng E:R A wth E :RE n Eq. 6) proves the coroary. Based on Coroary 1), a recursve formua for the feature seecton crteron can be deveoped as foows. heorem 2: Gven a set of features S. For any P S, F S) = F P) ẼR 2 F where E = A P P) A, and ẼR s the ow-rank approxmaton of E based on the subset R = S \ P of coumns. Proof: Substtutng wth P S) n Eq. 1) gves: F S) = A ÃS 2 F = A ÃP ẼR 2 F = E ẼR 2 F Usng the reaton between the Frobenus norm and the trace functon 1, the rght-hand sde can be expressed as: ) ) ) E ẼR 2 F = trace E ẼR) E ẼR = tracee E 2E Ẽ R + Ẽ RẼR) As R R) R R) = R R), the expresson Ẽ RẼR can be wrtten as: Ẽ RẼR = E R R) R R) E = E R R) E = E Ẽ R hs means that: F S) = E ẼR 2 F = tracee E Ẽ R Ẽ R ) = E 2 F ẼR 2 F. Repacng E 2 F wth F P) proves the theorem. he term ẼR 2 F represents the decrease n reconstructon error acheved by addng the subset R of features to P. In 1 A 2 F = tracea A)

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng the foowng secton, a nove greedy heurstc s presented to optmze the feature seecton crteron based on ths recursve formua. VI. GREEDY SELECION ALGORIHM hs secton presents an effcent greedy agorthm to optmze the feature seecton crteron presented n Secton IV. he agorthm seects at each teraton one feature such that the reconstructon error for the new set of features s mnmum. hs probem can be formuated as foows. Probem 2: Greedy Feature Seecton) At teraton t, fnd feature such that, = arg mn F S {}) 7) where S s the set of features seected durng the frst t 1 teratons. A naïve mpementaton of the greedy agorthm s to cacuate the reconstructon error for each canddate feature, and then seect the feature wth the smaest error. hs mpementaton s however computatonay very compex as t requres Om 2 n 2 ) foatng-pont operatons per teraton. A more effcent approach s to use the recursve formua for cacuatng the reconstructon error. Usng heorem 2, F S {}) = F S) Ẽ{} 2 F, where E = A ÃS. Snce F S) s a constant for a canddate features, an equvaent crteron s: = arg max Ẽ{} 2 F 8) hs formuaton seects the feature whch acheves the maxmum decrease n reconstructon error. he new objectve functon Ẽ{} can be smpfed as foows: 2 F Ẽ{} 2 ) ) = trace Ẽ {} Ẽ {} = trace E R {}) E F = trace E ) E : E 1 ) : E : E : E = 1 E: E trace E E : E: E ) E E : 2 = : E: E. : hs defnes the foowng smpfed probem. Probem 3: Smpfed Greedy Feature Seecton) At teraton t, fnd feature such that, = arg max E E : 2 E : E : where E = A ÃS, and S s the set of features seected durng the frst t 1 teratons. he computatona compexty of ths seecton crteron s O n 2 m ) per teraton, and t requres O nm) memory to store the resdua of the whoe matrx, E, after each teraton. In the rest of ths secton, two nove technques are proposed to reduce the memory and tme requrements of ths seecton crteron. 9) A. Memory-Effcent Crteron hs secton proposes a memory-effcent agorthm to cacuate the smpfed feature seecton crteron wthout expcty cacuatng and storng the resdua matrx E at each teraton. he agorthm s based on a recursve formua for cacuatng the resdua matrx E. Let S t) denote the set of features seected durng the frst t 1 teratons, E t) denote the resdua matrx at the start of the t-th teraton.e., E t) = A ÃS t)), and t) be the feature seected at teraton t. he foowng emma gves a recursve formua for resdua matrx at the end of teraton t, E t+1). Lemma 2: E t+1) can be cacuated recursvey as: E t+1) = E E :E: E: E E) t). : Proof: Usng Coroary 1, Ã S {} = ÃS + Ẽ{}. Subtractng both sdes from A, and substtutng A ÃS {} and A ÃS wth E t+1) and E t) respectvey gves: E t+1) t) = E Ẽ{}) Usng Eqs 1) and 2), Ẽ {} can be expressed as E: E: E ) :) 1 E: E. Substtutng Ẽ {} wth ths formua n the above equaton proves the emma. Let G be an n n whch represents the nner-products over the coumns of the resdua matrx E: G = E E. he foowng coroary s a drect resut of Lemma 2. Coroary 3: G t+1) can be cacuated recursvey as: G t+1) = G G :G : G ) t). Proof: hs coroary can be proved by substtutng wth E t+1) Lemma 2) n G t+1) = E t+1) E t+1), and usng the fact that E : E: E ) :) 1 E: E: E: E ) :) 1 E: = E : E: E :) 1 E:. o smpfy the dervaton of the memory-effcent agorthm, at teraton t, defne δ = G : and ω = G : / G = δ/ δ. hs means that G t+1) can be cacuated n terms of G t) and ω t) as foows: G t+1) = G ωω ) t), 10) or n terms of A and prevous ω s as: G t+1) = A A t ωω ) r). 11) r=1 δ t) and ω t) can be cacuated n terms of A and prevous ω s as foows: t 1 δ t) = A A : ω r) ω r), r=1 ω t) = δ t) / δ t).

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng he smpfed feature seecton crteron can be expressed n terms of G as: = arg max G : 2 G he foowng theorem gves recursve formuas for cacuatng the smpfed feature seecton crteron wthout expcty cacuatng E nor G. heorem 4: Let f = G : 2 and g = G be the numerator and denomnator of the smpfed crteron functon for a feature respectvey, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Aω Σ t 2 r=1 ω r) ω ω r))) ) t 1), + ω 2 ω ω) ) t 1). g t) = g ω ω) where represents the Hadamard product operator. Proof: Based on Eq. 10), f t) can be cacuated as: f t) = G : 2) t) = G: ω ω 2) t 1) Smary, g t) = G : ω ω) G : ω ω) ) t 1) = G :G : 2ω G :ω + ω 2 ω 2) t 1) = f 2ω G :ω + ω 2 ω 2) t 1). can be cacuated as: g t) = G t) = g ω 2 12) = G ω 2 ) t 1) ) t 1) 13). Let f = [f ] =1..n and g = [g ] =1..n, f t) and g t) can be expressed as: f t) = f 2 ω Gω) + ω 2 ω ω) ) t 1), g t) = g ω ω)) t 1), 14) where represents the Hadamard product operator, and. s the 2 norm. Based on the recursve formua of G Eq. 11), the term Gω at teraton t 1) can be expressed as: Gω = A A Σ t 2 = A Aω Σ t 2 r=1 r=1 ωω ) ) r) ω ) 15) ω r) ω ω r) Substtute wth Gω n Equaton 14) gves the update formuas for f and g hs means that the greedy crteron can be memoryeffcent by ony mantanng two score varabes for each feature, f and g, and updatng them at each teraton based on ther prevous vaues and the seected features so far. B. Partton-Based Crteron he smpfed feature seecton crteron cacuates, at each teraton, the nner-products between each canddate feature E : and other features E. he computatona compexty of these nner-products s Onm) per canddate feature or On 2 m) per teraton). When the memory-effcent update formuas are used, the computatona compexty s reduced to Onm) per teraton that of cacuatng A Aω). However, the compexty of cacuatng the nta vaue of f s st On 2 m). In order to reduce ths computatona compexty, a nove partton-based crteron s proposed, whch reduces the number of nner products to be cacuated at each teraton. he crteron parttons features nto c n random groups, and seects the feature whch best represents the centrods of these groups. Let P j be the set of feature that beong to the j-th partton, P = {P 1, P 2,...P c } be a random parttonng of features nto c groups, and B be an m c matrx whose eement j-th coumn s the sum of feature vectors that beong to the j-th group: B :j = r P j A :r. he use of the sum functon nstead of mean) weghts each coumn of B wth the sze of the correspondng group. hs avods any bas towards arger groups when cacuatng the sum of nnerproducts. he smpfed seecton crteron can be wrtten as: Probem 4: Smpfed Partton-Based Greedy Feature Seecton) At teraton t, fnd feature such that, = arg max F E : 2 E : E : 16) where E = A ÃS, S s the set of features seected durng the frst t 1 teratons, F :j = r P j E :r, and P = {P 1, P 2,...P c } s a random parttonng of features nto c groups. Smar to E Lemma 2), F can be cacuated n a recursve manner as foows: F t+1) = F E :E: E: E F ) t). : hs means that random parttonng can be done once at the start of the agorthm. After that, F s ntazed to B and then updated recursvey usng the above formua. he computatona compexty of cacuatng B s Onm) f the data matrx s fu. However, ths compexty coud be consderaby reduced f the data matrx s very sparse. Further, a memory-effcent varant of the partton-based agorthm can be deveoped as foows. Let H be an c n matrx whose eement H j s the nner-product of the centrod of the j-th group and the -th feature, weghted wth the sze of the j-th group: H = F E. Smary, H can be cacuated recursvey as foows: H t+1) = H H :G : G ) t).

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng Defne γ = H : and υ = H : / G = γ/ δ. H t+1) can be cacuated n terms of H t), υ t) and ω t) as foows: H t+1) = H υω ) t), 17) or n terms of A and prevous ω s and υ s as: H t+1) = B A t υω ) r). 18) r=1 γ t) and υ t) can be cacuated n terms of A, B and prevous ω s and υ s as foows: t 1 γ t) = B A : ω r) υ r), r=1 υ t) = γ t) / δ t). he smpfed partton-based seecton crteron can be expressed n terms of H and G as: = arg max H : 2 G Smar to heorem 4, the foowng theorem derves recursve formuas for the smpfed partton-based crteron functon. heorem 5: Let f = H : 2 and g = G be the numerator and denomnator of the partton-based smpfed crteron functon for a feature respectvey, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Bυ Σ t 2 r=1 υ r) υ ω r))) ) t 1), + υ 2 ω ω) ) t 1). g t) = g ω ω) where represents the Hadamard product operator. Proof: he proof s smar to that of heorem 4. It can be easy derved by usng the recursve formua for H : nstead of that for G :. In these update formuas, A B can be cacuated once and then used n dfferent teratons. hs makes the computatona compexty of the new update formuas s Onc) per teraton. Agorthm 1 shows the compete greedy agorthm. he computatona compexty of the agorthm s domnated by that of cacuatng A A : n Step b) whch s of Omn) per teraton. he other compex step s that of cacuatng the nta f, whch s Omnc). However, these steps can be mpemented n an effcent way f the data matrx s sparse. he tota compexty of the agorthm s Omaxmnk, mnc)), where k s the number of features and c s the number of random parttons. Agorthm 1 Greedy Feature Seecton Inputs: Data matrx A, Number of features k Outputs: Seected features S, Steps: 1) Intaze S = { }, Generate a random parttonng P, Cacuate B: B :j = r P j A :r 2) Intaze f 0) = B A : 2, and g 0) = A : A : 3) Repeat t = 1 k: a) = arg max f t) /g t), S = S {} b) δ t) = A A : t 1 r=1 ωr) ω r) c) γ t) = B A : t 1 υ r) d) ω t) = δ t) /, υ t) = γ t) / e) Update f s, g s heorem 5) r=1 ωr) δ t) VII. EXPERIMENS AND RESULS δ t) Experments have been conducted on four benchmark data sets, whose propertes are summarzed n abe I. hese data sets were recenty used by Ca et a. [11] to evauate dfferent feature seecton methods n comparson to the Mut-Custer Feature Seecton ) method 2. In ths secton, seven methods for unsupervsed feature seecton are compared 3 : 1) : s a PCA-based method that seects features assocated wth the frst k prncpa components [1]. It has been shown that by Masae et a. [3] that ths method acheves a ow reconstructon error of the data matrx compared to other PCA-based methods 4. 2) : s the Feature Seecton usng Feature Smarty [13] method wth the maxma nformaton compresson as the feature smarty measure. 3) : s the Lapacan Score ) [8] method. 4) : s the spectra feature seecton method [9] usng a the egenvectors of the graph Lapacan. 5) : s the Mut-Custer Feature Seecton [11] method whch has been shown to outperform other methods that preserve the custer struture of the data. 6) : he basc greedy agorthm presented n ths paper usng recursve update formuas for f and g but wthout random parttonng). 7) : he partton-based greedy agorthm Agorthm 1). 2 Data sets are avaabe at: http://www.zjucadcg.cn/dengca/data/facedata.htm http://www.zjucadcg.cn/dengca/data/mldata.htm 3 he foowng mpementatons were used: : http://www.facweb.tkgp.ernet.n/ pabtra/paper/fsfs.tar.gz : http://www.zjucadcg.cn/dengca/data/code/lapacanscore.m : http://featureseecton.asu.edu/agorthms/fs uns spec.zp : http://www.zjucadcg.cn/dengca/data/code/ p.m 4 he CPFA method was not ncuded n the comparson as ts mpementaton detas were not competey specfed n [3].

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng Smar to prevous work [8], [11], the feature seecton methods were compared based on ther performance n custerng tasks. wo custerng agorthms were used to compare dfferent methods: the we-known k-means agorthm [15], and the state-of-the-art affnty propagaton AP) agorthm [16]. For each feature seecton method, the k- means agorthm s apped to the rows of the data matrx whose coumns are the subset of the seected features. For the affnty propagaton, a dstance matrx s frst cacuated based on the seected subset of features, and then the agorthm s apped to the negatve of ths dstance matrx. he preference vector, whch contros the number of custers, s set to the medan of each coumn of the smarty matrx, as suggested by Frey and Dueck [16]. After the custerng s performed usng the subset of seected features, the custer abes are compared to ground-truth abes provded by human annotators and the Normazed Mutua Informaton NMI) [17] between custerng abes and the cass abes s cacuated. he custerng performance wth a features s aso cacuated and used as a basene. In addton to custerng performance, the run tmes of dfferent feature seecton methods are compared. hs run tme ncudes the tme for seectng features ony, and not the run tme of the custerng agorthm. Fgures 1 and 2 show the custerng performance for the k-means and affnty propagaton AP) agorthms respectvey 5. It can be observed from resuts that the greedy feature seecton methods and ) outperforms the,,, and methods for amost a data sets. he method outperforms for many data sets, whe ts partton-based varant,, outperforms for some data sets and shows comparabe performance for others. Fgure 3 shows the run tmes of dfferent feature seecton methods. It can be observed that s computatonay more expensve than other methods as t depends on cacuatng compex smartes between features. he method, however effcent, s more computatonay compex than Lapacan score ) and the proposed greedy methods. It can be aso observed that for data sets wth arge number of nstances ke USPS), the method, the Lapacan score ) and the become very computatonay demandng as they depend on cacuatng parwse smartes between nstances. Fgure 4 shows the run tmes of the PCA- LRG and Lapacan score ) methods n comparson to the proposed greedy methods. It can be observed that the compexty of the Lapacan score ncreases as the sze of the data set ncreases. It can aso be observed that the parttonbased greedy feature seecton ) s more effcent than the basc greedy feature seecton ). 5 he mpementatons of and AP do not scae to run on the USPS data set on the used smuaton machne. abe I HE PROPERIES OF DAA SES USED O EVALUAE DIFFEREN FEAURE SELECION MEHODS [11]. NMI %) NMI %) NMI %) NMI %) Data set # Instances # Features # Casses ORL 0 1024 COIL 14 1024 ISOLE 15 617 26 USPS 9298 256 10 65 55 75 65 55 45 ORL COIL ISOLE USPS Fgure 1. he k-means custerng performance of dfferent feature seecton methods.

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng 80 75 ORL COIL ISOLE NMI %) 65 55 Fgure 2. NMI %) 65 55 he affnty propagaton AP) custerng performance of dfferent feature seecton methods. NMI %) VIII. CONCLUSIONS hs paper presents a nove greedy agorthm for unsupervsed feature seecton. he agorthm optmzes a feature seecton crteron whch measures the reconstructon error of the data matrx based on the subset of seected features. he paper proposes a nove recursve formua for cacuatng the feature seecton crteron, whch s then empoyed to deveop an effcent greedy agorthm for feature seecton. In addton, two memory and tme effcent varants of the feature seecton agorthm are proposed. It has been emprcay shown that the proposed agorthm acheves better custerng performance compared to state-of-the-art methods for feature seecton, and s ess computatonay demandng than methods that gve comparabe custerng performance. REFERENCES [1] I. Joffe, Prncpa Component Anayss, 2nd ed. Sprnger, 02. [2] H. Zou,. Haste, and R. bshran, Sparse prncpa component anayss, J. Comput. Graph. Stat., vo. 15, no. 2, pp. 265 286, 06. [3] M. Masae, Y. Yan, Y. Cu, G. Fung, and J. Dy, Convex prncpa feature seecton, n Proceedngs of SIAM Internatona Conference on Data Mnng SDM), 10, pp. 619 628. [4] Y. Cu and J. Dy, Orthogona prncpa feature seecton, n the Sparse Optmzaton and Varabe Seecton Workshop at the Internatona Conference on Machne Learnng ICML), 08. [5] Y. Lu, I. Cohen, X. Zhou, and Q. an, Feature seecton usng prncpa feature anayss, n Proceedngs of the 15th Internatona Conference on Mutmeda. New York, NY, USA: ACM, 07, pp. 1 4. [6] C. Boutsds, M. W. Mahoney, and P. Drneas, Unsupervsed feature seecton for prncpa components anayss, n Proceedng of the 14th ACM SIGKDD Internatona Conference on Knowedge Dscovery and Data Mnng KDD 08). New York, NY, USA: ACM, 08, pp. 61 69. [7] C. Boutsds, M. Mahoney, and P. Drneas, Unsupervsed feature seecton for the k-means custerng probem, n Advances n Neura Informaton Processng Systems 22, Y. Bengo, D. Schuurmans, J. Lafferty, C. K. I. Wams, and A. Cuotta, Eds., 09, pp. 153 161. [8] X. He, D. Ca, and P. Nyog, Lapacan score for feature seecton, n Advances n Neura Informaton Processng Systems 18, Y. Wess, B. Schökopf, and J. Patt, Eds. Cambrdge, MA, USA: MI Press, 06, pp. 7 514. [9] Z. Zhao and H. Lu, Spectra feature seecton for supervsed and unsupervsed earnng, n Proceedngs of the 24th Internatona Conference on Machne earnng. New York, NY, USA: ACM, 07, pp. 1151 1157. [10] L. Wof and A. Shashua, Feature seecton for unsupervsed and supervsed nference: he emergence of sparsty n a weght-based approach, J. Mach. Learn. Res., vo. 6, pp. 1855 1887, 05. [11] D. Ca, C. Zhang, and X. He, Unsupervsed feature seecton for mut-custer data, n Proceedngs of the 16th ACM SIGKDD Internatona Conference on Knowedge Dscovery and Data Mnng KDD 10). New York, NY, USA: ACM, 10, pp. 333 342. [12] A. Ng, M. Jordan, and Y. Wess, On spectra custerng: Anayss and an agorthm, n Advances n Neura Informaton Processng Systems 14 NIPS 01). Cambrdge, MA, USA: MI Press, 01, pp. 849 856. [13] P. Mtra, C. Murthy, and S. Pa, Unsupervsed feature seecton usng feature smarty, IEEE rans. Pattern Ana. Mach. Inte., vo. 24, no. 3, pp. 1 312, 02. [14] H. Lütkepoh, Handbook of Matrces. John Wey & Sons Inc, 1996. [15] A. Jan and R. Dubes, Agorthms for Custerng Data. Prentce-Ha, Inc., 1988. [16] B. Frey and D. Dueck, Custerng by passng messages between data ponts, Scence, vo. 315, no. 5814, p. 972, 07. [17] A. Streh and J. Ghosh, Custer ensembes a knowedge reuse framework for combnng mutpe parttons, J. Mach. Learn. Res., vo. 3, pp. 583 617, 03.

hs artce has been accepted for pubcaton at the 11 IEEE 11th Internatona Conference on Data Mnng ORL ORL Run me Seconds) 10 Run me Seconds) 2.5 2 1.5 1 0.5 Run me Seconds) 80 COIL Run me Seconds) 7 6 5 4 3 2 1 COIL Run me Seconds) Run me Seconds) 10 0 2 0 1 100 ISOLE USPS Run me Seconds) Run me Seconds) 1.4 1.2 1 0.8 0.6 0.4 0.2 10 ISOLE USPS Fgure 3. he run tmes of dfferent feature seecton methods. Fgure 4. he run tmes of and methods n comparson to the proposed greedy agorthms.