Data Visualization and Feature Selection: New Algorithms for Nongaussian Data

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data Howard Hua Yang and John Moody Oregon Graduate nstitute of Science and Technology NW, Walker Rd., Beaverton, OR976, USA hyang@ece.ogi.edu, moody@cse.ogi.edu, FAX:5 74846 Abstract Data visualization and feature selection methods are proposed based on the )oint mutual information and CA. The visualization methods can find many good -D projections for high dimensional data interpretation, which cannot be easily found by the other existing methods. The new variable selection method is found to be better in eliminating redundancy in the inputs than other methods based on simple mutual information. The efficacy of the methods is illustrated on a radar signal analysis problem to find -D viewing coordinates for data visualization and to select inputs for a neural network classifier. Keywords: feature selection, joint mutual information, CA, visualization, classification. NTRODUCTON Visualization of input data and feature selection are intimately related. A good feature selection algorithm can identify meaningful coordinate projections for low dimensional data visualization. Conversely, a good visualization technique can suggest meaningful features to include in a model. nput variable selection is the most important step in the model selection process. Given a target variable, a set of input variables can be selected as explanatory variables by some prior knowledge. However, many irrelevant input variables cannot be ruled out by the prior knowledge. Too many input variables irrelevant to the target variable will not only severely complicate the model selection/estimation process but also damage the performance of the final model. Selecting input variables after model specification is a model-dependent approach[6]. However, these methods can be very slow if the model space is large. To reduce the computational burden in the estimation and selection processes, we need modelindependent approaches to select input variables before model specification. One such approach is 6-Test [7]. Other approaches are based on the mutual information (M) [,,4] which is very effective in evaluating the relevance of each input variable, but it fails to eliminate redundant variables. n this paper, we focus on the model-independent approach for input variable selec-

688 H. H. Yang and J. Moody tion based on joint mutual information (JM). The increment from M to JM is the conditional M. Although the conditional M was used in [4] to show the monotonic property of the M, it was not used for input selection. Data visualization is very important for human to understand the structural relations among variables in a system. t is also a critical step to eliminate some unrealistic models. We give two methods for data visualization. One is based on the JM and another is based on ndependent Component Analysis (CA). Both methods perform better than some existing methods such as the methods based on PCA and canonical correlation analysis (CCA) for nongaussian data. Joint mutual information for input/feature selection Let Y be a target variable and Xi'S are inputs. The relevance of a single input is measured by the M (Xi;Y) = K(p(Xj,y)llp(Xi)P(Y)) where K(pllq) is the Kullback-Leibler divergence of two probability functions P and q defined by K(p(x)llq(x)) = Lx p(x) log. The relevance of a set of inputs is defined by the joint mutual information (Xi'..., Xk; Y) = K(P(Xi'..., Xk, y)llp(xi'..., Xk)P(Y)) Given two selected inputs Xj and Xk, the conditional M is defined by (Xi; YXj, Xk) = L p(xj, xk)k(p(xi' ylxj, xk)llp(xilxj, xk)p(ylxj, Xk)). Similarly define (Xi; YXj,..., X k) conditioned on more than two variables. The conditional M is always non-negative since it is a weighted average of the Kullback-Leibler divergence. t has the following property (X., Xn- l, Xn; Y) - (X,, Xn- l ; Y) = (Xn; YX,, Xn-) : o. Therefore, (X,, Xn- l, Xn; Y) : (X,, Xn- l ; Y), i.e., adding the variable Xn will always increase the mutual information. The information gained by adding a variable is measured by the conditional M. When Xn and Yare conditionally independent given Xl,, X n - l, the conditional M between Xn and Y is so Xn provides no extra information about Y when Xl,,Xn - l are known. n particular, when Xn is a function of Xl,..., Xn- l, the equality () holds. This is the reason why the joint M can be used to eliminate redundant inputs. The conditional M is useful when the input variables cannot be distinguished by the mutual information (Xi;Y). For example, assume (X;Y) = (X;Y) (X; Y), and the problem is to select (Xl, X), (Xl, X) or (X' X). Since (X,X;Y) - (X,X;Y) = (X;YX) - (X;YXt}, we should choose (Xl, X) rather than (Xl, X) if (X; YX ) > (X; YX ). Otherwise, we should choose (Xl, X). All possible comparisons are represented by a binary tree in Figure. To estimate (X,..., Xk; Y), we need to estimate the joint probability P(X,, Xk, y). This suffers from the curse of dimensionality when k is large. ()

Data Visualization and Feature Selection 689 Sometimes, we may not be able to estimate high dimensional M due to the sample shortage. Further work is needed to estimate high dimensional joint M based on parametric and non-parametric density estimations, when the sample size is not large enough. n some real world problems such as mining large data bases and radar pulse classification, the sample size is large. Since the parametric densities for the underlying distributions are unknown, it is better to use non-parametric methods such as histograms to estimate the joint probability and the joint M to avoid the risk of specifying a wrong or too complicated model for the true density function. (xl. x) (xl.x) "' A.. l(xl ;Y\X»:(X;Y\Xy\!(XLY\X)<(X;Y\X) (Xl.Y\X»:(X;Y\X/,,\;Y\Xl<(X;Y\X) /"\ (xl.x) (x.x) (x.xl) (xl,x) Figure : nput selection based on the conditional M. n this paper, we use the joint mutual information (Xi, Xj; Y) instead of the mutual information (Xi; Y) to select inputs for a neural network classifier. Another application is to select two inputs most relevant to the target variable for data visualiz ation. Data visualization methods We present supervised data visualization methods based on joint M and discuss unsupervised methods based on CA. The most natural way to visualize high-dimensional input patterns is to display them using two of the existing coordinates, where each coordinate corresponds to one input variable. Those inputs which are most relevant to the target variable corresponds the best coordinates for data visualization, Let (i*, j*) = arg maxu,n(xi, Xj; Y). Then, the coordinate axes (Xi-, Xj-) should be used for visualizing the input patterns since the corresponding inputs achieve the maximum joint M. To find the maximum (Xj-, Xj-Y), we need to evaluate every joint M (Xi' Xj; Y) for i < j. The number of evaluations is O(n ). Noticing that (Xj,Xj;Y) = (Xi;Y) + (Xj;YXi), we can first maximize the M (Xi; Y), then maximize the conditional M. This algorithm is suboptimal, but only requires n - evaluations of the joint Ms. Sometimes, this is equivalent to exhaustive search. One such example is given in next section. Some existing methods to visualize high-dimensional patterns are based on dimensionality reduction methods such as PCA and CCA to find the new coordinates to display the data, The new coordinates found by PCA and CCA are orthogonal in Euclidean space and the space with Mahalanobis inner product, respectively. However, these two methods are not suitable for visualizing nongaussian data because the projections on the PCA or CCA coordinates are not statistically independent for nongaussian vectors. Since the JM method is model-independent, it is better for analyzing nongaussian data.

69 H H Yang and J. Moody Both CCA and maximumjoint M are supervised methods while the PCA method is unsupervised. An alternative to these methods is CA for visualizing clusters [5]. The CA is a technique to transform a set of variables into a new set of variables, so that statistical dependency among the transformed variables is minimized. The version of CA that we use here is based on the algorithms in [, 8]. t discovers a non-orthogonal basis that minimizes mutual information between projections on basis vectors. We shall compare these methods in a real world application. 4 Application to Signal Visualization and Classification 4. Joint mutual information and visualization of radar pulse patterns Our goal is to design a classifier for radar pulse recognition. Each radar pulse pattern is a 5-dimensional vector. We first compute the joint Ms, then use them to select inputs for the visualization and classification of radar pulse patterns. A set of radar pulse patterns is denoted by D = {(zi, yi) : i = "", N} which consists of patterns in three different classes. Here, each Zi E R t5 and each yi E {,, }. 4 " Ml"" mlormabon Condionai N given X, i e- ::E.8 6 to 8 ;; t>:,.' > t> > " nputvanablallldex (a). '.a..,..:: : >.. ' 5 J J J.4 5 6 7 8 9 4 5 bundle nutrtjer (b) Figure : (a) M vs conditional M for the radar pulse data; maximizing the M then the conditional M with O(n) evaluations gives (Xil' Xii; Y) =. bits. (b) The joint M for the radar pulse data; maximizing the joint M gives (Xi.,Xj-; Y) =. bits with O(n ) evaluations of the joint M. (il' it) = (i*, j*) in this case. Let it = arg maxj(xi;y) and it = arg maxj;tij(xj;yxi ). From Figure (a), we obtain (it,it) = (,9) and (XillXjl;Y) = (Xil;Y) + (Xj;YXi ) =. bits. f the number of total inputs is n, then the number of evaluations for computing the mutual information (Xi; Y) and the conditional mutual information (Xj; YXiJ is O(n). To find the maximum (Xi-, X j >; Y), we evaluate every (Xi, Xj; Y) for i < j. These Ms are shown by the bars in Figure (b), where the i-th bundle displays the Ms (Xi,Xj;Y) for j = i+ " ", 5. n order to compute the joint Ms, the M and the conditional M is evaluated (n) and O(n ) times respectively. The maximumjoint M is (Xi-, X j-; Y) =. bits. Generally, we only know (Xil ' Xii; Y) (Xi-, Xj-; Y). But in this particular

Data Visualization and Feature Selection 69 application, the equality holds. This suggests that sometimes we can use an efficient algorithm with only linear complexity to find the optimal coordinate axis view (Xi,Xj.). The joint M also gives other good sets of coordinate axis views with high joint M values. '" c.. o 8.,.. <>!O x. i ' / h N '"llll '" 'l' -,, - 4 firs, prinopol oomponen' X (a) J 5 (b) J 5, N f ", ", ', Cl,. j;l;,..j 5 /,,,, ", f, " " a>, :ai 'l' -<5 - ' -5-6 - F.rstLD - - - (c) (d) Figure : (a) Data visualization by two principal components; the spatial relation between patterns is not clear. (b) Use the optimal coordinate axis view (Xi., Xj-) found via joint M to project the radar pulse data; the patterns are well spread to give a better view on the spatial relation between patterns and the boundary between classes. (c) The CCA method. (d) The CA method. Each bar in Figure (b) is associated with a pair of inputs. Those pairs with high joint M give good coordinate axis view for data visualization. Figure shows that the data visualizations by the maximum JM and the CA is better than those by the PCA and the CCA because the data is nongaussian. 4. Radar pulse classification Now we train a two layer feed-forward network to classify the radar pulse patterns. Figure shows that it is very difficult to separate the patterns by using just two inputs. We shall use all inputs or four selected inputs. The data set D is divided

69 H H Yang and J. Moody into a training set D and a test set D consisting of percent patterns in D. The network trained on the data set D using all input variables is denoted by Y = f(x,'",xn; W, W, ) where W and W are weight matrices and is a vector of thresholds for the hidden layer. From the data set D, we estimate the mutual information (Xi; Y) and select i l = arg maxj(xi ; Y). Given Xii' we estimate the conditional mutual information (Xj; YXii ) for j == i l. Choose three inputs Xi'J' Xi and Xi4 with the largest conditional M. We found a quartet (i, i, i, i4) = (,,,9). The two-layer feedforward network trained on D with four selected inputs is denoted by Y = g(x,x, X, X g; W, W, '). There are 65 choices to select 4 input variables out of 5. To set a reference performance for network with four inputs for comparison. Choose quartets from the set Q = {(h,h,h,h): jl < h < h < j4 5}. For each quartet (h,h,h,j4), a two-layer feed-forward network is trained using inputs (XjllXh,Xh,Xj4)' These networks are denoted by Y = hi(xil,xh, Xh, X j4 ; W, W, "), i =,"",..55. -... w..q ER. wlh)qjna - - -... ER 'd\mcpdltl - -. f'ner l'4ddtnpil Xl. X,l(,n:G >---+ etngst.., 4.-..ct... X, X.)(J, Ml xv 5..., - - - - - - - - - nini' EA... Xl,X, lq, _)fj... ER;... Xl.X. lq... X... a....,..... Eft......,..5 l\ \.', 5 5 \ " \.,,, '7;:Y.-....... - - Y -- - (a) (b) 5 5 Figure 4: (a) The error rates of the network with four inputs (Xl, X, X, Xg) selected by the joint M are well below the average error rates (with error bars attached) of the networks with different input quartets randomly selected; this shows that the input quartet (X,X,X,X9 ) is rare but informative. (b) The network with the inputs (X,X,X,X9 ) converges faster than the network with all inputs. The former uses 65% fewer parameters (weights and thresholds) and 7% fewer inputs than the latter. The classifier with the four best inputs is less expensive to construct and use, in terms of data acquisition costs, training time, and computing costs for real-time application. The mean and the variance of the error rates of the networks are then computed. All networks have seven hidden units. The training and testing error rates of the networks at each epoch are shown in Figure 4, where we see that the network with four inputs selected by the joint M performs better than the networks with randomly selected input quartets and converges faster than the network with all inputs. The network with fewer inputs is not only faster in computing but also less expensive in data collection. '.

Data Visualization and Feature Selection 69 5 CONCLUSONS We have proposed data visualization and feature selection methods based on the joint mutual information and CA. The maximum JM method can find many good -D projections for visualizing high dimensional data which cannot be easily found by the other existing methods. Both the maximum JM method and the CA method are very effective for visualizing nongaussian data. The variable selection method based on the JM is found to be better in eliminating redundancy in the inputs than other methods based on simple mutual information. nput selection methods based on mutual information (M) have been useful in many applications, but they have two disadvantages. First, they cannot distinguish inputs when all of them have the same M. Second, they cannot eliminate the redundancy in the inputs when one input is a function of other inputs. n contrast, our new input selection method based on th joint M offers significant advantages in these two aspects. We have successfully applied these methods to visualize radar patterns and to select inputs for a neural network classifier to recognize radar pulses. We found a smaller yet more robust neural network for radar signal analysis using the JM. Acknowledgement: This research was supported by grant ONR N4-96-- 476. References [] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. n Advances in Neural nformation Processing Systems, 8, eds. David S. Touretzky, Michael C. Mozer and Michael E. Hasselmo, MT Press: Cambridge, MA., pages 757-76, 996. [] G. Barrows and J. Sciortino. A mutual information measure for feature selection with application to pulse classification. n EEE ntern. Symposium on Time Frequency and Time-Scale Analysis, pages 49-5, 996. [] R. Battiti. Using mutual information for selecting features in supervised neural net learning. EEE Trans. on Neural Networks, 5(4):57-55, July 994. [4] B. Bonnlander. Nonparametric selection of input variables for connectionist learning. Technical report, PhD Thesis. University of Colorado, 996. [5] C. Jutten and J. Herault. Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 4:-, 99. [6] J. Moody. Prediction risk and architecture selection for neural network. n V. Cherkassky, J.H. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks: Theory and Pattern Recognition Applications. NATO AS Series F, Springer-Verlag, 994. [7] H. Pi and C. Peterson. Finding the embedding dimension and variable dependencies in time series. Neural Computation, 6:59-5, 994. [8] H. H. Yang and S. Amari. Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(7):457-48, 997.