Feature Selection via Correlation Coefficient Clustering

Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Direction and Strength of Stock Market Movement

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

An Interest-Oriented Network Evolution Mechanism for Online Communities

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Performance Analysis and Coding Strategy of ECOC SVMs

A Secure Password-Authenticated Key Agreement Using Smart Cards

Calculation of Sampling Weights

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

SIMPLE LINEAR CORRELATION

What is Candidate Sampling

An Alternative Way to Measure Private Equity Performance

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

1. Measuring association using correlation and regression

Single and multiple stage classifiers implementing logistic discrimination

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Support Vector Machines

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Statistical Methods to Develop Rating Models

Gender Classification for Real-Time Audience Analysis System

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

A DATA MINING APPLICATION IN A STUDENT DATABASE

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

DEFINING %COMPLETE IN MICROSOFT PROJECT

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

The OC Curve of Attribute Acceptance Plans

Mining Multiple Large Data Sources

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

IMPACT ANALYSIS OF A CELLULAR PHONE

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

L10: Linear discriminants analysis

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Searching for Interacting Features for Spam Filtering

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Data Visualization by Pairwise Distortion Minimization

Design and Development of a Security Evaluation Platform Based on International Standards

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

A novel Method for Data Mining and Classification based on

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Implementation of Deutsch's Algorithm Using Mathcad

CHAPTER 14 MORE ABOUT REGRESSION

Calculating the high frequency transmission line parameters of power cables

Machine Learning and Software Quality Prediction: As an Expert System

Lecture 2: Single Layer Perceptrons Kevin Swingler

Damage detection in composite laminates using coin-tap method

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

The Greedy Method. Introduction. 0/1 Knapsack Problem

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Semantic Link Analysis for Finding Answer Experts *

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

The Application of Fractional Brownian Motion in Option Pricing

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

Detecting Credit Card Fraud using Periodic Features

A Probabilistic Theory of Coherence

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

1 Example 1: Axis-aligned rectangles

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Implementations of Web-based Recommender Systems Using Hybrid Methods

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Improved SVM in Cloud Computing Information Mining

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

The Journal of Systems and Software

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Cluster Analysis. Cluster Analysis

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Improved Mining of Software Complexity Data on Evolutionary Filtered Training Sets

Invoicing and Financial Forecasting of Time and Amount of Corresponding Cash Inflow

A Performance Analysis of View Maintenance Techniques for Data Warehouses

How To Calculate The Accountng Perod Of Nequalty

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Rank Based Clustering For Document Retrieval From Biomedical Databases

Support vector domain description

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

Transcription:

JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 1371 Feature Selecton va Correlaton Coeffcent Clusterng Hu-Huang Hsu Department of Computer Scence and Informaton Engneerng, Tamkang Unversty, Tape, 5137, Tawan Emal: h_hsu@mal.tku.edu.tw Cheng-We Hseh Department of Computer Scence and Informaton Engneerng, Tamkang Unversty, Tape, 5137, Tawan Emal: 89190108@s9.tku.edu.tw Abstract Feature selecton s a fundamental problem n machne learnng and data mnng. How to choose the most problem-related features from a set of collected features s essental. In ths paper, a novel method usng correlaton coeffcent clusterng n removng smlar/redundant features s proposed. The collected features are grouped nto clusters by measurng ther correlaton coeffcent values. The most class-dependent feature n each cluster s retaned whle others n the same cluster are removed. Thus, the most class-related and mutually unrelated features are dentfed. The proposed method was appled to two datasets: the dsordered proten dataset and the Arrhythma (ARR) dataset. The expermental results show that the method s superor to other feature selecton methods n speed and/or accuracy. Detal dscussons are gven n the paper. Index Terms Feature Selecton, Clusterng, Correlaton Coeffcent, Support Vector Machnes (SVMs), Machne Learnng, Classfcaton I. INTRODUCTION Feature selecton ams to select the most problemrelated features and to remove unnecessary features [1]. The unnecessary features nclude both nosy and redundant features. We can say that f a feature cannot help mprove the classfcaton accuracy, the feature s useless and unnecessary. The nosy feature s especally meant to harm the classfcaton results. If the class classfcaton result s mproved by removng some features, we can say that these features could be nosy features. But one mportant queston s how to fnd these nosy features? The wrapper mode feature selecton model could be helpful []. However, t s usually very tme consumng, because t combnes some learnng machnes whch are the core of selectng features [3][4]. Features whch lower the overall accuracy by the learnng machne wll be removed from the orgnal feature set. The procedure would be progressvely repeated untl the classfcaton accuracy cannot be further mproved. Ths procedure needs complcated computaton and always takes a lot of tme. In ths paper we focus on reducng repeated or redundant features. The targetng features may not be exactly the same, but they are closely related. Smlar features nputted to the classfer not only ncrease the computaton tme, but also decrease ts classfcaton capablty. There are several measures whch are helpful n fndng the redundant features. For example, mutual nformaton, correlaton coeffcent, and ch-square can be used to fnd the dependency between two features. However, for a large amount of features, ths parwse dependency nformaton s not enough for us to fnd the features whch are close to each other n groups. Hence, clusterng analyss s appled here. It s a very useful technque to dvde a feature set nto subsets wthn whch features are closely related to each other. If we can separate the collected features nto such groups, we need to keep only one feature n each group because they are almost the same. Therefore we can greatly reduce the number of features by removng those redundant features. Clusterng analyss usually uses Eucldean dstance as the smlarty measurement. But measurements based on the nformaton theory could be more helpful n fndng dependency between two varables than smply measurng the dstance n space. In ths research, the correlaton coeffcent nstead of the Eucldean dstance s used for clusterng analyss. The correlaton coeffcent of two random varables s a quantty that measures the mutual dependency of the two varables. Hence, when two features are mutually dependent, t means the occurrence and varaton of the two features must be almost the same. For a classfcaton problem, we need to keep only one of them snce they share almost the same characterstcs. For hundreds or even thousands of collected features, there must be features that are very smlar to each other, and we can take these features as the same knd of features. We certanly do not need to use all features of the same knd for classfcaton. After clusterng analyss dentfes all dfferent knds of features, we can remove a great number of redundant features. The classfcaton performance n both the computatonal speed and the classfcaton accuracy can be mproved wth the removal of these redundant features. A novel feature selecton algorthm based on the above-mentoned correlaton coeffcent clusterng s proposed n ths paper. Support vector machnes (SVMs) [5] are used as the classfer for testng the feature selecton results on two datasets: do:10.4304/jsw.5.1.1371-1377

137 JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 dsordered proten data and Arrhythma (ARR) data. Detals are gven n the subsequent sectons. The rest of ths paper s organzed as follows. Secton II ntroduces related work. Secton III presents the proposed clusterng feature selecton mechansm. Secton IV descrbes the SVM learnng model and the datasets. Secton V shows expermental results and dscussons. Fnally, Secton VI draws a bref concluson. II. RELATED WORK Feature selecton methods have been appled to classfcaton problems n order to select a reduced feature set that makes the classfer faster and more accurate. Roughly speakng, the feature selecton model contans two dfferent modes: flters and wrappers []. The flters measure the nformaton of features [6][7] (e.g., nformaton gan) to decde the feature selecton result. Ths knd of model works fast, but the classfcaton result s not always satsfed. Because the flters contan no error rate controllng technque, the result of flters s not always stable. On the other hand, the wrappers combne a learnng model n t. The wrappers perform the feature selecton through two man steps: feature searchng and classfcaton error rate measurement. The feature searchng procedure selects features from the orgnal feature set and nput them nto the next classfcaton procedure to test ther predcton error rate. The wrappers work slowly because both the two man steps are very tme-consumng. Moreover, complex calculaton makes t dffcult to perform the wrappers on applcatons wth a large number of features. In our prevous research, we combned the flters and the wrappers to solve the applcatons wth a large number of features [8]. At frst, we use the fast flter models wth two nformaton measurement: nformaton gan and F-score. These two models can flter out a lot of features not that related to the problem. As mentoned above, the flter mght not provde a satsfed classfcaton result. Hence, we perform the wrappermode feature selecton to mprove the classfer s predcton result. The hybrd mechansm was appled to the proten dsordered regon predcton problem whch s to fnd out the unstructured regons of protens. The learnng model used n t was the support vector machne. In the expermental results, 350 features were selected from the orgnal 440 features and the predcton accuracy was 8.7%. One way to solve the problem of redundant or repeated features s to use some knd of feature dependency measurements, such as mutual nformaton (MI), correlaton coeffcent, or ch-square. A mutual nformaton feature selecton mechansm was proposed by Huang et al. [9]. They used a flter approach to perform the feature selecton. In ther pont of vew, there are two types of nput features perceved as beng unnecessary. They are features completely rrelevant to the output classes and features redundant gven other nput features. By usng the mutual nformaton performed on class-related and feature-related features, feature selecton can be done. The concept s from the nformaton theory whch analyzes the relatonshp between features and classes to remove the redundant features and the most rrelevant features to the class. Another feature dependency measurement feature selecton was proposed by Peng et al. [10]. They also used mutual nformaton to perform feature selecton. Ther orgnal feature selecton concept s based on features max dependency (MaxDep) [11] whch measures the feature sets statstcal dependency wth the target class. MaxDep selects m features that jontly have the largest dependency on the target class. The fnal selected features have the maxmal dependency values that are calculated from some smlarty measurements, for example, correlaton coeffcent or mutual nformaton. However, the estmaton of MaxDep s very hard due to ts multvarate dependency measurement whch s retreved from a hgh dmensonal space. Both feature searchng and nformaton measurng are qute tmeconsumng. In order to mprove MaxDep, Peng et al. desgned a two-stage feature selecton algorthm by combnng the mnmal-redundancy-maxmal-relevance crteron (mrmr) and other more sophstcated feature selectors. It calculates the features wth the maxmal class-related value whle ths feature s n the mnmal redundancy wth all the already selected features. It then performs optmal frst-order ncremental selecton to mprove the classfcaton result. By usng some wrapper knd of feature selecton model (e.g., forward/backward floatng search), they get the fnal compact feature set wth the hghest classfcaton accuracy. The results confrm that mrmr leads to promsng mprovement on feature selecton and classfcaton accuracy. For the feature dependency measurement technques, the correlaton coeffcent also plays an mportant role though t has not been used as often as mutual nformaton. From the defnton, the correlaton coeffcent provdes a quanttatve measurement that represents the strength of a lnear relatonshp between two sequences of observatons. Hence, for most varables relatonshp tests, calculatng correlaton coeffcents would be the frst step to determne f they are lnearly dependent. On the other hand, mutual nformaton s based on the knowledge measurement, whch handles the test of how much knowledge one can gan of a certan varable by knowng the value of another varable. Mutual nformaton helps reduce the range of the probablty densty functon for a random varable x f the varable y s known. Therefore, f we only want to test the dependency between two varables nstead of testng the knowledge gan, t s preferable to use the correlaton coeffcent. In the next secton, we ntroduce the correlaton coeffcent based feature selecton model whch can fnd out redundant features by testng parwse feature dependency. III. CORRELATION COEFFICIENT CLUSTERING FOR FEATURE SELECTION To fnd related feature groups s not an easy task. The parwse smlarty measurements of the whole feature set are hard to be realzed due to a large amount of huge

JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 1373 calculatons. Besdes, the result of parwse measurements cannot be used to dentfy multple smlar features. Thus we propose to use clusterng analyss to group the most related features together. Ths could dvde the feature set nto groups of multple features. The Eucldean dstance s the most used smlarty measurement n clusterng analyss. However, t does not ft our feature selecton goal. Therefore, we replace the dstance measurement wth the correlaton coeffcent n clusterng. Next, feature selecton wthn feature clusters s also an mportant problem. Ths s also an mportant procedure of feature selecton. One representatve feature needs to be pcked from each feature cluster. In prevous researches, lttle attenton was pad to ths problem. The researchers thought that snce the features n the same cluster are almost the same, any of them can be chosen and the classfcaton results would be about the same. But there exsts dfference among those smlar features. Here we propose to choose the feature most related to the class n each feature cluster. The feature that has the hghest correlaton coeffcent value wth the class label s pcked. The followng subsectons ntroduce the clusterng mechansm, the correlaton coeffcent, and the proposed correlaton coeffcent clusterng algorthm for feature selecton. A. Clusterng Clusterng s one of the most wdely used technques for exploratory data analyss. It also can be consdered as the most mportant unsupervsed learnng problem. Practcally, clusterng analyss fnds a structure n a collecton of unlabeled data. Hence, t separates the orgnal dataset nto smaller datasets called clusters. Data n each cluster are close to each other. Fg.1 demonstrates such separaton of data. Cluster Cluster 1 Cluster 3 Fgure 1. Separaton of data va clusterng. Clusterng algorthms can be classfed as herarchcal clusterng, overlappng clusterng, exclusve clusterng, and probablstc clusterng [1]. In our research, we only consder the exclusve clusterng, and that means each node n the Fg.1 can only belong to one cluster. There are also many clusterng algorthms. Among them, K- means s the classcal one. For K-means clusterng, t works on separatng n observatons nto k clusters, and each observaton belongs to the nearest mean s cluster. Usually the Eucldean dstance s used as the dstance metrc to calculate the observatons relatonshp. K- means clusterng works as the followng steps. 1. Randomly select k nodes as the means from n observatons, where k n.. Calculate the Eucldean dstance from each node to all the means, and the (n-k) observatons belong to ther respectve nearest mean. 3. Re-calculate the means of all clusters m 1, m,, m k. 4. Repeat Steps and 3 untl the content of each cluster s fxed. Fnally, each cluster could represent a dfferent collecton from the other clusters. By usng ths knd of clusterng models, the observatons could be easly separated accordng to the Eucldean dstance measurement. Ths s much better than measurng the dstance between each pars for all the observatons consderng the tme complexty. However, the Eucldean dstance can only measure the space dstance between observatons. The observatons dependency cannot be revealed. Hence, n ths paper, we apply the correlaton coeffcent n clusterng to measure the dependency of all observatons. B. Correlaton Coeffcent In statstcs, the correlaton coeffcent ndcates the strength and drecton of a relatonshp between two random varables. The commonest use refers to a lnear relatonshp. In general statstcal usage, correlaton or corelaton refers to the departure of two random varables from ndependence. Equaton (1) shows the calculaton of the correlaton coeffcent between two varables x and y. There are totally n observatons. r xy = n = 1 x n = 1 x y nxy nx = 1 Two varables have strong dependency when ther correlaton coeffcent value s close to 1 or -1. When the value s 0, t means that the two varables are not related at all. In our research, strong dependency s what we are lookng for, no matter t s postve or negatve. Therefore, n the measurement procedure, the absolute value of the correlaton coeffcent r s used. C. Correlaton Coeffcent Clusterng Algorthm In ths study, we combne the correlaton coeffcent wth clusterng analyss for feature selecton. Instead of usng the Eucldean dstance, we choose the correlaton coeffcent as the smlarty measurement as dscussed n the prevous subsecton. Moreover, clusterng analyss can separate the whole feature set nto dfferent groups. Closely related features can be put together after the frst clusterng steps. The features are dvded nto dfferent knds of groups accordng to ther dependency. And each knd of groups can represent a part of the feature space. For the fnal goal of feature selecton, we must choose the most relevant and non-redundant features from the orgnal feature set to reduce the number of features. In ths approach, only one feature s needed from each knd/cluster of features. The reason s that features n the n y ny (1)

1374 JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 same cluster are very close to each other and we do not need to use more than two features of the same knd to perform the classfcaton task. Fg. shows the concept of the proposed feature selecton model. In the clusterng procedure, we use the correlaton coeffcent as the smlarty measurement to check the dependency among features. a 1, a, a 3, a 4,., a n Clusterng Whole feature set c 1 c c k Correlaton coeffcent clusterng for smlar features Feature selecton from k clusters m 1 m m k Removal of redundant features Fgure. The process of correlaton coeffcent clusterng feature selecton. The remaned features are the result of feature selecton. A problem comes up here regardng how to pck the representatve feature for each feature cluster. That s, whch feature n a cluster should we keep? We propose to pck the most class-dependent feature n each cluster as the representatve one. The correlaton coeffcent can also be used to decde the class-feature dependency. The most class-dependent features from all clusters can certanly help mprove the overall classfcaton accuracy. The pseudocode of the proposed correlaton coeffcent clusterng feature selecton algorthm s as follows. Randomly select k nodes m(m 1,, m k ) from n observatons a(a 1,,a n ); WHILE orgnally selected k nodes m(m 1,, m k )!= new selected k nodes m (m 1,, m k ) FOR = 1 to n (observatons) FOR j = 1 to k (nodes) r j = Correlaton_Coeffcent (a, m j ); IF r j MAX(r 1, r,, r k ) a belongs to m j s cluster; END IF FOR p = 1 to k (clusters C 1,,C k ) FOR q = 1 to C p s length t(cluster p s contents s 1,,s t ) r q = Correlaton_Coeffcent (s q, Class labels); IF r q MAX(r 1, r,, r p ) m p = s q ; END IF END WHILE RETURN m (m 1,, m k ) ; Next, we make a bref comparson of the proposed method wth mrmr. Frst, mrmr only choose the most nformatonal features,.e., the most class-related features. As we know the the m best features are not the best m features [13], the result by mrmr mght gnore features whch are not so closely related to the class label, but can complement other features to mprove the classfcaton result. In the proposed method, no such features would be mssed. Secondly, the Mn-Redundancy step of mrmr only randomly keeps one of the Max-Relevance features. On the other hand, the proposed method retans the most class-related feature n each feature cluster by calculatng the correlaton coeffcents between the features and the class. Other features n the same cluster are then removed. IV. LEARNING MODEL AND DATASETS A machne learnng method s needed when we apply the proposed feature selecton n classfcaton problems. The support vector machne (SVM) was chosen for the experments n ths research due to ts advantages n the use of kernels for nonlnear problems and the optmzaton of the separatng margns. Furthermore, t can avod the local mnma problems durng the tranng process. In ths secton, the datasets used for the experments n ths research are also ntroduced. A. Support Vector Machne The SVM s based on the SV (support vector) learnng. That means the SVM does not always compare the predcton target wth all the exstng tranng nodes. In contrast, the SVM selects a group of nodes as ts SVs, and uses these SVs to judge the label of the classfcaton target. In the testng stage, the SVM model uses the SVs to do the classfcaton. These SVs locate near the hyperplanes that cause the maxmum margn of class separaton. Fg. 3 demonstrates the maxmum margn between two classes whch are separated by the hyperplane n the SVM model. H 1 and H are the boundares. And the nodes whch are located near these two lnes are support vectors. H 1 H Fgure 3. The SVM could fnd out the maxmum margn and use the SVs to predct the predcton targets. Boundares H1 and H are located on these SVs.

JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 1375 B. Datasets Proten dsordered regon predcton s the frst problem tred n ths research. In proteomcs, a proten s functon s always strongly related to ts structure. Whle some parts of a proten have a fxed defnte structure, such as α-helx, β-sheet, or col, other parts are not assocated wth well-defned conformatons. Prevously, these so-called dsordered regons were not thought to have a specfc functon of ther own. But, recent studes suggest that some dsordered regons may have mportant sgnalng or regulatory functons. In addton, some crtcal dseases are strongly related to these dsordered regons. Thus, proten dsordered regon predcton s an mportant problem. However, the most relevant features n ths problem are yet to be determned [14][15]. Our ordered and dsordered sequences were collected from the PDB [16] and DsProt [17] databases. The protens n DsProt are all wth dsordered regons. The proten sequences collected from PDB contan mostly ordered regons. Those data selected from DsProt are taken as postve tranng data, and the negatve tranng data are derved from PDB_Select_5 [18] whch s a non-redundant dataset of the Proten Data Bank (PDB). Fnally, 119 proten sequences wth 440 features were collected and there are totally 1676 resdues. The 440 features were determned from related researches [8]. In order to compare the proposed method wth MaxDep and mrmr, the Arrhythma (ARR) dataset from UCI machne learnng archve [19] was also used. The am of ths dataset s to dstngush between the absence and presence of cardac arrhythma and to classfy a datum nto one of the 16 classes. However, we can only consder two states: normal and abnormal. Class 1 refers to normal, Classes to 15 refer to dfferent abnormal classes of arrhythma, and Class 16 refers to the other unclassfed ones. In ths dataset, there are totally 45 nstances wth 79 features. Among the features, 06 are lnear values and the rest are nomnal. V. EXPERIMENTAL RESULTS A software tool has been mplemented for the proposed feature selecton method (Fg. 4). C#.NET n MS Vsual Studo was used to develop the tool. The user can determne the number of clusters, the smlarty measurement, and the clusterng method n our tool. As for the determnaton of the number of clusters n the experments, several methods have been tred, namely, gap statstc [0], Calnsk-Harabasz ndex [1], Krzanowsk-La ndex [], and Hartgan statstc [3]. Most of them compare the values of between-cluster sums of squares and the values of wthn-cluster sums of squares to detect the dstrbuton of data. Followng ther dstrbuton, the number of clusters can be estmated. There are two man problems. Frst, these methods can only gve estmates and sometmes perform not so precsely. Secondly, n our experment, the number of clusters s also the fnal number of remaned features. Accordng to the past researches, wth only man classrelated features the classfer mght not perform well. Sometmes t s necessary to nclude some addtonal features to mprove the classfer s dscrmnaton ablty. Therefore, n our experments, although we had the estmated number of clusters from these models, we stll tred several dfferent numbers of clusters. Fgure 4. The nterface of the feature selecton software tool For the SVM learnng machne n ths experment, we use the RBF kernel. The expermental results of proten dsordered regon predcton wth the proposed method are lsted n TABLE 1. There are totally 440 features n the orgnal dataset. The best result va fve-fold crossvaldaton s 86.30% wth only 00 features. It s much better than the result produced by our prevous work wth a hybrd feature selecton model [8]. The best result n [8] was 8.7% wth 350 features. The number of features s further reduced by 34% ((350-00)/440) and the classfcaton accuracy s rased by 3.58%. Ths demonstrates the usefulness of the proposed feature selecton method. TABLE 1. FIVE-FOLD CROSS-VALIDATION RESULTS ON DISORDERED PROTEIN DATA Feature number Accuracy (5-fold cross-valdaton) 50 8.8% 100 84.00% 150 85.67% 00 86.30% Next, we compare the proposed method wth mrmr and MaxDep [10] on the ARR dataset. Fg. 5 shows that the proposed method s better than MaxDep and comparable to mrmr n classfcaton accuracy. The number of selected features ranges from 5 to 55 (from the orgnal 79 features). The proposed method provdes a better and more stable result than MaxDep. In the procedure of feature searchng, MaxDep has to search through the whole feature set wth dfferent combnatons. Ths procedure also takes tme. The proposed method dd not perform better than mrmr. The reason s that mrmr ncorporates the wrapper mode n the second stage of ts feature selecton

1376 JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 procedure. The wrapper mode works as a post modfcaton step whch can further mprove the classfcaton accuracy by repeatedly usng a learnng machne. Ths repeated process s very tme-consumng. On the other hand, our method only uses clusterng analyss once. It s more lke a flter mode feature selecton procedure that does not requre very complex calculatons. Fgure 5. Ten-fold cross-valdaton accuracy comparson among MaxDep, mrmr, and correlaton coeffcent clusterng feature selecton on Arrhythma data (learnng machne: SVM) From the expermental results, we can observe that the number of features can be greatly reduced by the proposed method on both datasets. The advantage of the proposed method s that t can execute much faster than the wrapper-mode feature selecton methods whle mantanng comparable classfcaton accuracy. Clusterng analyss s very helpful n fndng maxmal dependency among features. Each cluster can represent a dfferent knd of features. VI. CONCLUSION In ths paper, a novel feature selecton method s proposed. The key characterstc of the method s to apply clusterng analyss n groupng the collected features. Only one representatve feature s needed from each feature group. Ths can greatly reduce the total number of features. In the method, the correlaton coeffcent s used to fnd smlar features wth maxmum dependency. It s also used to dentfy the most class-dependent feature as the representatve feature n each feature cluster. Flter-mode feature selecton methods only focus on dentfyng the most class-related features wthout consderng redundancy among these features. Also, some removed features are actually helpful to the overall classfcaton performance, but are vewed as not so classrelated and removed just because ther measures are low. On the other hand, feature selecton methods nvolved wth the wrapper mode requre a lot of computatons. The proposed method s advantageous to both flter-mode and wrapper-mode methods. Ths method s yet to consder the removal of nosy features whch can be harmful to the overall performance. One smple way to dentfy possble nosy data s to look for the representatve features whch have a low correlaton coeffcent value wth the class. A representatve feature wth a near zero correlaton coeffcent value should defntely be removed. But experments are needed to carefully examne the threshold settng. Ths s one future drecton of ths research. REFERENCES [1] C. Desy, B. Subbulakshm, S. Baskar, and N. Ramaraj, Effcent Dmensonalty Reducton Approaches for Feature Selecton, Internatonal Conference on Computatonal Intellgence and Multmeda Applcatons, vol., pp. 11-17, 007. [do: 10.1109/ICCIMA.007.88] [] R. Kohav, and G. John, Wrappers for Feature Subset Selecton, Artfcal Intellgence, vol. 97, pp. 73-34, 1997. [do: 10.1016/S0004-370(97)00043-X] [3] J. R. Qunlan, Dscoverng Rules from Large Collectons of Examples: A Case Study, In Mche, D. ed., Expert Systems n the Mcroelectronc Age, Scotland: Ednburgh Unversty Press, Ednburgh, 1979, pp. 168-01. [4] Y. Lu, Y. F. Yn, J. J. Gao, and C. G. Tan, Wrapper Feature Selecton Optmzed SVM Model for Demand Forecastng, The Internatonal Conference on Young Computer Scentsts, pp. 953-958, 008. [do: 10.1109/ICYCS.008.151] [5] LIBSVM - A Lbrary for Support Vector Machnes, http://www.cse.ntu.edu.tw/~cjln/lbsvm/ (last accessed Nov 3, 009) [6] A. Al-An, A dependency-based search strategy for feature selecton, Expert Systems wth Applcatons: An Internatonal Journal, vol.36, pp. 139-1398, 009. [7] B. Bonev, F. Escolano and M. Angel-Cazorla, A Novel Informaton Theory Method for Flter Feature Selecton, MICAI 007: Advances n Artfcal Intellgence, Sprnger Berln / Hedelberg, pp. 431-440, 007. [8] H.-H. Hsu, C.-W. Hseh, and M.-D. Lu, A Hybrd Feature Selecton Mechansm, n Proc. Eghth Internatonal Conference on Intellgent Systems Desgn and Applcatons (ISDA 008), vol., pp. 71-76, Kaohsung, Tawan, Nov. 6-8, 008. [do: 10.1109/ISDA.008.80] [9] J. J. Huang, Y. Z. Ca, and X. M. Xu, A Flter Approach to Feature Selecton Based on Mutual Informaton, Cogntve Informatcs, vol. 1, pp. 84-89, 006. [do: 10.1109/COGINF.006.365681] [10] H. C. Peng, F. H. Long and C. Dng, Feature Selecton Based on Mutual Informaton: Crtera of Max- Dependency, Max-Relevance, and Mn-Redundancy, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 7, no. 8, pp. 16-138, 005. [do: 10.1109/TPAMI.005.159] [11] C. Dng and H. C. Peng, Mnmum Redundancy Feature Selecton from Mcroarray Gene Expresson Data, Proc. Second IEEE Computatonal Systems Bonformatcs Conf., pp. 53-58, 003. [do: 10.1109/CSB.003.17396] [1] M. Matteucc, A Tutoral on Clusterng Algorthms, http://home.de.polm.t/matteucc/clusterng/tutoral_htm l/ (last accessed Feb 3, 010) [13] T. M. Cover, The Best Two Independent Measurements Are Not the Two Best, IEEE Trans. Systems, Man, and Cybernetcs, vol. 4, pp. 116-117, 1974. [14] C. Bracken, L. M. Iakoucheva, P.R. Romero, and A.K. Dunker, Combnng predcton, computaton and

JOURNAL OF SOFTWARE, VOL. 5, NO. 1, DECEMBER 010 1377 experment for the characterzaton of proten dsorder, Curr. Opn. Struct. Bol, vol. 14, pp. 570-576, 004. [15] K. Peng, P. Radovojac, S. Vucetc, A. K. Dunker and Z. Obradovc, Length-dependent predcton of proten ntrnsc dsorder, BMC Bonformatcs, vol. 7, pp. 08, 006. [do: 10.1186/1471-105-7-08] [16] H. M. Berman, J. Westbrook, Z. Feng, G. Gllland, T. N. Bhat, H. Wessg et al., The Proten Data Bank, Nuclec Acds Resource, vol.8, pp. 35-4, 000. [do:10.1107/s090744490003451] [17] S. Vucetc, Z. Obradovc, V. Vacc, P. Radvojac, K. Peng, L. M. Iakoucheva et al., DsProt: A Database of Proten Dsorder, Bonformatcs, vol 1, pp. 137-140, 005. [do: 10.1093/bonformatcs/bth476] [18] S. F. Altschul, W. Gsh, W. Mller, E. W. Myers, and D. J. Lpman, Basc local algnment search tool, J. Mol. Bol., vol. 15, pp. 403-410, 1990. [do:10.1006/jmb.1990.9999] [19] UCI machne learnng repostory, http://www.cs.uc.edu/mlearn/mlsummary.html (last accessed Feb 3, 010) [0] R. Tbshran, G. Walther, and T. Haste, Estmatng the number of clusters n a data set va the gap statstcs, Journal of the Royal Statstcal Socety, Seres B 63, pp. 411-43, 001. [1] R. B. Calnsk, and J. A. Harabasz, A denrte method for cluster analyss, Communcatons n Statstcs, vol. 3, pp. 1-7, 1974. [] L. Kaufman, and P. Rousseeuw, Fndng Groups n Data: An Introducton to Cluster Analyss. New York: Wley, 1990. [3] J. A. Hartgan, Clusterng Algorthms. Wley, 1975. Hu-Huang Hsu s an Assocate Professor n the Department of Computer Scence and Informaton Engneerng at Tamkang Unversty, Tape, Tawan. He receved both hs PhD and MS Degrees from the Department of Electrcal and Computer Engneerng at the Unversty of Florda, USA, n 1994 and 1991, respectvely. He has publshed over 80 referred papers and book chapters, as well as partcpated n many nternatonal academc actvtes. Hs current research nterests are n the areas of machne learnng, data mnng, bo-medcal nformatcs, ambent ntellgence, and multmeda processng. He s a senor member of the IEEE. Cheng-We Hseh receved hs master s degree n Computer Scence and Informaton Engneerng at Natonal Central Unversty. Hs MS degree s from the Department of Computer Scence & Informaton Engneerng at Tamkang Unversty, Tape, Tawan. He s a PhD canddate n the Department of Computer Scence & Informaton Engneerng at Tamkang Unversty, Tape, Tawan. Hs major research nterests nclude applcatons n bonformatcs, machne learnng.