Sparse Representation using Nonnegative Curds and Whey

Transcription

1 Sparse Representation using Nonnegative Curds and Whey Yanan Liu, Fei Wu, Zhihuang Zhang, Yueting Zhuang College of Computer Science and Technology, Zhejiang University, China {liuyn, wufei, zhzhang, Shuicheng Yan Department of Electrical and Computer Engineering, National University of Singapore, Singapore Abstract It has been of great interest to find sparse and/or nonnegative representations in computer vision literature. In this paper we propose a novel method to such a purpose and refer to it as nonnegative curds and whey (NNCW). The NNCW procedure consists of two stages. In the first stage we consider a set of sparse and nonnegative representations of a test image, each of which is a linear combination of the images within a certain class, by solving a set of regressiontype nonnegative matrix factorization problems. In the second stage we incorporate these representations into a new sparse and nonnegative representation by using the group nonnegative garrote. This procedure is particularly appropriate for discriminant analysis owing to its supervised and nonnegativity nature in sparsity pursuing. Experiments on several benchmark face databases and Caltech 0 image dataset demonstrate the efficiency and effectiveness of our nonnegative curds and whey method.. Introduction The problem of finding a sparse representation for the data has become an interesting topic recently in computer vision and pattern recognition. The essential challenge to be resolved in sparse representation is to develop an efficient approach with which each sample could be reconstructed from its sparse representation. Nonnegative matrix factorization (NMF) [2, 3] is an important technique for finding such a representation. It is well shown that NMF is able to produce such a sparse representation in a collective way [0, ]. Moreover, the nonnegativity constraint makes the representation easy to interpret due to purely additive combinations of nonnegative basis vectors. The NMF technique has been successfully applied in computer vision and pattern recognition, especially for image analysis [2]. Many of these applications are under an unsupervised setting and incidentally ignored the correlations within a same class and the disparity between different classes. Under the supervised setting, however, the NMF can be regarded as a nonnegative garrote [2]. Figure. An exemplar illustration of sparse representation using nonnegative curds and whey. In this paper we consider a supervised setting for image representation as well as image classification. With the empirically validated discriminativity of sparse representation in classification, ideally, a test image outside training images can be represented just in terms of the training images and only coefficients of those samples which belong to the same class with test image may be nonzero. That is, a valid test image could be sufficiently represented by the training samples from the same class. Sparse representation could expedite classification when the number of classes is reasonably large. The sparser the coefficients are, the easier the test sample is accurately assigned to its class label. Therefore, when the test image is expressed as linear superposition of all the training images, the coefficient vector is expected to be sparse and nonnegative.

2 In particular, we model the sparse representation by using a nonnegative curds and whey (NNCW) method. The key idea takes advantage of similarity within the same class and disparity between different classes to formulate the classification problem as two consequent linear regressions. That is to say, a test image is represented as a nonnegative weighted combination of all the training images. For this combination, we introduce two sets of sparse nonnegative weight coefficients, one of which is for each training image within a certain class and another is for each class. Our work is motivated by the latest work of Wright et al. [22], which cast the face recognition problem as a liner regression problem with sparse constraints for regression coefficients. To solve the regression problem, Wright et al. [22] reformulated it as the lasso problem [20]. Lassobased sparse representation was also used for image annotation with multiple tags [2], classification [7], and clustering [6]. For example, Wang et al. [2] proposed a multi-label sparse coding framework for automatic image annotation, which takes advantage of the l -norm based reconstruction coefficients. In [7], an empirical Bayesian approach to sparse regression and classification is presented, which does not involve any parameters controlling the degree of sparseness. Elhamifar and Vidal [6] introduced a sparse representation-based method to cluster data from multiple low-dimensional subspaces embedded in a highdimensional space. However, the lasso makes the representation unnecessarily additive. This might result in that the representation is not interpretable as NMF. Moreover, the class label or discriminant information from the training set was not apparently incorporated during constructing sparse representation, which may limit the ultimate classification accuracy. Our proposed method can circumvent these limitations since the two steps of linear regressions not only utilize the discriminative class information but also impose the nonnegativity constraint for each coefficient. Beyond the image classification in question, the nonnegative curds and whey is also related to the group nonnegative garrote [23], which is a grouped extension of the conventional nonnegative garrote. The estimate of the the regression coefficients in group nonnegative garrote for individual variables is based on the least squares error, thus these coefficients are not necessarily nonnegative and zero. Figure illustrates the overall procedure of the proposed NNCW method. Intuitively, only one regression model is learned by lasso-based representation methods without utilizing any discriminative label information and putting nonnegative constraints on each sample during convex optimization. However, NNCW first obtains m independent representations (called curds) from each class and then uses curds to redefine a new representation (called whey). Those two kinds of regression models are constructed in consequent order, and the later step directly output the class label information. The rest of this paper is organized as follows. In Section 2, we introduce the details on nonnegative curds and whey (NNCW) method for image representation and classification. Section 3 reviews the related work. Experiment results are reported in Section 4. Finally, we conclude this work in Section Methodology Given a set of n training samples, X = {x,..., x n } R d, where x i is a d-dimensional feature vector representing an image. Here the images are assumed to be grouped into m disjoint classes and each x i belongs to one and only one class. Let n j be the cardinality of the jth class. We then have m j= n j = n. Without loss of generality, we put the samples in the jth class into a d n j matrix X j. Accordingly, we form an d n training data matrix X = [X,..., X m ]. Our current concern is that of training a classifier when the sparse representation of a test image y R d is constructed from the training data, we can predict its corresponding label. The basic idea is to devise a sparse representation approach for the development of classifiers. Before formally presenting our method, we give some notation to be used in this paper. For a p vector a = (a,..., a p ) T, we by a 2 denote the l 2 -norm of a (i.e., p a 2 = j= a2 j ), by a denote the l -norm of a (i.e., a = p j= a j ) and by a 0 denote the l 0 -norm of a (i.e., the number of nonzero entries of a). The sparse representation-based classification approach proposed in this paper learns two inalienable linear regression models with nonnegative coefficient constraints under supervised learning framework. The direct point of our approach is to take discriminative information to make the classifier more interpretable (therefore structural) and an additive model. 2.. Nonnegative Curds and Whey Procedure Our proposed sparse representation approach consists of two stages. In the first stage we consider m linear regression models by treating y as the response and each image from X j as one basis. That is, the jth regression problem is based on: y = X T j b j + ɛ j, () where ɛ j is an error term, and b j = (b j,... b j,nj ) T R nj for j =,..., m. Recall that y and x i represent reference and training images respectively, so they are typically encoded as nonnegative values. The idea behind nonnegative matrix factorization for learning parts-based representation [2] inspires us to impose nonnegativity on the repression vectors b i. As a

3 result, we have the following optimization problems, for j =,..., m b j 2 y XT j b j λ j s.t. b jl 0, l n j b jl, l= where the λ j 0 are tunable weighting parameters. For j =,..., m, each optimization problem in (2) is a nonnegative garrote model [2]. The nonnegative garrote can be efficiently solved by classical numerical methods such as the least angle regression (LARS) [5] and pathwise coordinate method [8]. However, we follow Breiman s original implementation [2] to solve the optimization problems. The approach used in [2] for optimization is to shrink each ordinary least squares (OLS) estimated coefficient by a nonnegative amount whose sum is subject to an upper bound constraint (the garrote). Let the ˆb j = (ˆb j,..., ˆb jnj ) T be the estimate of the b j. As n j l= b jl = a, the b j should be sparse, and this leads to m sparse representations of y. We express them as z j = X j ˆbj for j =,..., m. Actually, since ˆb j is the reconstruction coefficients learned from samples within the jth class, all of ˆb j could be used to denote the difference of the different classes if we put ˆb j together. Therefore, to capture the class label information from the training samples and further make use of the disparity of different classes, we consider the following optimization problem, m c,...c m 2 y m c j z j λ p j c j, j= s.t. c j 0, j j= where λ 0 is a tunable weighting parameter, and the p j > 0 are degrees of penalties. In all experiments, we set p j = n j /n. The optimization problem in (3) is also a nonnegative garrote model. We can solve it again by using Breiman s implementation [2]. As we can see, the second stage further refines the representation of y by using the representations of y obtained in the first stage. In particular, y is now represented as j= (2) (3) m m u = ĉ j z j = ĉ j X ˆb j j. (4) j= As m j= p jc j can be considered as weighted l -norm of c = [c, c 2,..., c m ] T, some of the ĉ j are zeros, and the representation is sparse. Moreover, if ĉ j = 0 for a j {,..., m}, it means all samples from the jth class are eliminated from this representation due to ĉ j ˆbj = 0 and the test image y therefore apparently does not belong to the jth class. Algorithm NNCW (nonnegative curds and whey) : procedure NNCW({X,..., X m } R d ; y R d ) 2: Curds: Solve the following optimization problems b j 2 y XT j b j λ j s.t. b jl 0, l, n j b jl, l= for j =,..., m. 3: Whey: Solve the following optimization problem m c,...c m 2 y m c j z j λ p j c j, j= s.t. c j 0, j j= 4: Output: The class label k of the test sample y is 5: end procedure k = y ĉ j X T ˆb j j 2 2. j The optimization problems in (2) define m independent representations of y, which we call curds. The optimization problem in (3) then takes advantages of such m curds to define a new representation, which we call whey. Since we impose the the nonnegativity constraints on the b j as well as the c j, we refer to our method for sparse image representation as the nonnegative curds and whey (NNCW) Classification Procedure Given the test sample y and its corresponding ˆb j and ĉ j for j =,..., m, which are obtained from the NNCW method, we are now concerned with the class label of y. Ideally, the nonzero ĉ j indicates the class to which the test sample y belongs. However, it is not always the case that there is only one nonzero ĉ j. Thus, we allocate y to the kth class with k = y ĉ j X ˆb j j 2 2. (5) j We summarize the entire NNCW method in Algorithm. 3. Related Work A so-called curds and whey method was first proposed by Breiman and Friedman and was a form of multivariate shrinking [3]. However, the main purpose of the method in [3] is to improve predictive accuracy in multiple linear regression by using correlations between the response variables. As discussed before, NNCW first makes use of intra-class information to generate m independent representations (curds), then consequently utilizes inter-class information to generate a linear regression model (whey) for discriminative learning.

4 To some extent, our proposed NNCW can be regarded as a variant of the group lasso [23]. Group lasso is a natural extension of lasso and the covariates in group lasso are assumed to be clustered in groups. Intuitively, Group lasso will derive all the weights in one group to zero together and thus lead to group selection. Different from NNCW in this paper which put nonnegative efficient constraints on each training samples and each class, there is no nonnegative coefficient constraint on group lasso. Specially, NNCW is closely related to the group nonnegative garrote [23]. The main difference lies in that the group nonnegative garrote instead uses z j = X T j bls j, where b LS j is the least square estimate. In this case, z j is not guaranteed to be nonnegative, although X j is nonnegative. Thus, z j may no longer represent a real image. Moreover, owing to its explicit dependence on the full least square estimates, in problems where the sample size is small relative to the total number of variables, the group nonnegative garrote may perform suboptimally. Naturally, group nonnegative garrote is not robust to image noise and occlusion, and thus we do not compare our algorithm with it and focus on sparse related algorithm instead. It is worth pointing out that the sparse representation in [22] tries to solve the following problem: β β 0, subject to Xβ = y, (6) where β = (b,..., b m ) T. However, this problem is NPhard []. Based on the sparse theory from Donoho [4], Wright et al. [22] thus consider the following alternative: β 2 y Xβ λ n m j b ji, (7) j= i= which is essentially the lasso model. On one hand, this sparse model for classification does not consider the discriminative class information, which is definitely useful for classification. Moreover, in many regression problems, we are interested in finding important integratable factors in predicting the categorical information, where each factor may be represented by a group of derived variables. On the other hand, since β is not nonnegative, such sparse representations do not have interpretable properties as NMF and NNCW. The strength of this work is to integrate sparse coding, nonnegative data factorization and supervise learning together in NNCW framework. The two inseparate learned linear regression models encode similarity and disparity information useful for data classification. Moreover, the nonnegative constraints and natural sparsity in NNCW make it to be more interpretable. 4. Experiments In this section, we investigate the applications of our nonnegative curds and whey (NNCW) method in face recognition and image classification. We compare NNCW with three popular classification methods: nearest neighbor (NN), naive Bayes (NB), linear support vector machines (SVM), as well as sparse-representation based classification (SRC) proposed by Wright et al. in [22] and group lasso (glasso) [23]. The tuning parameters λ and λ j (j =,..., m) are evaluated by 0-fold cross-validation to avoid over-fitting. Four face databases and one image dataset were used. The face databases include ORL database [8], Extended Yale B database [9], AR face database [7] and CMU PIE face database [9]. Specifically these four face databases focus on frontal faces, illumination condition variations, occlusions and different poses, respectively. We also conducted experiments on Caltech 0 image dataset [5]. 4.. Visualization on Face Dataset We first give a visualization comparison of nonnegative curds and whey (NNCW) with lasso-based sparse representation (LSR) and group lasso (glasso) on a subset of ORL dataset. We chose 0 persons from ORL database, and 9 images per subject to comprise the training data. Then the one remaining image per person is treated as test sample. Figure 2, 3 and 4 respectively demonstrate the visualization results of sparse representation by LSR, glasso and our NNCW with the test sample from the first individual. Figure 4(a) shows the first stage of NNCW which computes z j, j 0, and Figure 4(b) illustrates the second stage of NNCW, which refines the sparse representation of y in the first stage. We can see that by incorporating the class label information in the second stage of NNCW, the optimized sparse estimates of class weights c j ( j 0) lead to quite sparse coefficients for test sample y. Specifically, after the computation in NNCW, we obtain an optimized c (i.e., 0.86) and b (i.e., [0, 0, 0, 0, 0.58, 0, 0, 0.0, 0]) to estimate y. These two sets of estimation parameters can reconstruct y effectively. In Figure 2, except for the first class to which the test sample belongs, the weights of other classes calculated with LSR are not sparse enough. That is why the reconstruction of y is not so good as NNCW. Besides, although glasso chose the right class in Figure 3, the reconstruction result of NNCW is much better than that of glasso. A possible explanation is that the negative coefficients of glasso bring negative visual affect for the representation of images Recognition on frontal faces The ORL database consists of 400 face images of 40 people (0 samples per person). The images with 92 2 pixels were captured at different times and have different variations including expressions (open or closed eyes, smiling

5 Figure 2. Visualization of LSR on the sampled ORL dataset. The test image y belongs to the subject. Figure 3. Visualization of glasso on the sampled ORL dataset. The test image y belongs to the subject. or non-smiling) and facial details (glasses or no glasses). To compute the recognition rate, the images are downsampled to 48, 99, 220, and 644 feature dimensions, which correspond to downsampling ratios of /5, /0, /6, and /4, respectively. For each subject, 6 images are randomly selected for training and the rest are used for testing. been seen that all of the six algorithms achieve good performance, since ORL database contains almost all frontal faces with little pose or illumination variations. The proposed NNCW achieves the best recognition accuracy rate of 96.53%, compared to 96.04% for glasso, 95.47% for SRC, 93.64% for SVM, 94.38% for NB, and 93.44% for NN Recognition with illumination variations Figure 5. Comparison of face recognition accuracy rates on ORL database. Figure 5 shows the face recognition results on ORL database using six different classification methods. It can The Extended Yale B database consists of 244 frontalface images of 38 persons. The cropped face images were captured under various illumination conditions [4]. The illumination type is determined uniquely by azimuthal and elevational values, where the azimuth changes from -30o to 30o, and the elevation ranges from -40o to 90o. Firstly we randomly select half of the images (about 32 images per individual) for training, and the other half for testing, as in [22]. The images are downsampled to 30, 56, 20, and 504 feature dimensions, corresponding to the downsampling ratios of /32, /24, /6, and /8, respectively. From Figure 6, we can see that NNCW improves the highest recognition accuracy rate to 95.29% from 80.63% for NN, 87.25% for NB, 85.54% for SVM, 94.3% for SRC, and 94.58% for glasso. Secondly we divide the images with 504 feature dimensions into five subsets of increasing azimuth illumination

6 (a) The first stage of NNCW (b) The second stage of NNCW Figure 4. Visualization of NNCW on the sampled ORL dataset. The test image y belongs to the subject. Figure 6. Comparison of face recognition accuracy rates on Extended Yale B face database. angles, i.e., the frontal illuminated images are used as the training set, the test set, 2, 3, 4 include the face images under variant illumination conditions for which the light angle varies from 5o to 5o, from 20o to 45o, from 50o to 75o, from 75o to 30o, respectively. Figure 7 illustrates the face recognition accuracy rates under different illumination situations. We can see that the recognition accuracy rates decline with the increased lighting angles, which indicates that the variations of illumination affect the face recognition performance, especially for nearest neighbor (NN) classifier. Our NNCW achieves a recognition accuracy rate between 87.22% and 98.75%, much better than the other methods. Figure 7. Comparion of face recognition accuracy rates under different illuminations on Extended Yale B database Recognition with occlusions The AR face database comprises over 4,000 color images corresponding to the faces from 26 people (70 male and 56 female). This dataset includes frontal view faces with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf). Each person participated in two sessions, separated by two weeks (4 days) time, totally 26 pictures were taken. In this experiment, as in [22], we firstly chose a subset of the database consisting of 50 male individuals and 50 female individuals. For each individual, the 3 images from Session were selected for training, and the 3 images for Session 2 were for testing. The images are firstly cropped with feature dimension of and converted into grayscale. The images are also downsampled to 30, 54, 30, 540-dimensional feature spaces, with downsampling

7 ratios of /24, /8, /2, and /6, respectively. Figure 8 shows the recognition accuracy rates for this experiment. NNCW achieves a recognition accuracy rate of 90.5% with 540 dimensional features, higher than the other methods, e.g % for NN, 83.32% for NB, 84.8% for SVM, 88.33% for SRC and 88.98% for glasso. test set 2 (c, c37), test set 3 (c02, c4), and test set 4 (c22, c34). Figure 8. Comparison of face recognition accuracy rates on AR database. Moreover, we test the classification performances with different occlusions on a subset (70 male, 55 female except women-027 due to the corrupted image w bmp) of the AR face database. We use 750 (4 each) unoccluded frontal face images as training set. Test set with sunglasses occlusion contains 750 images (6 each), and test set 2 with scarves occlusion also consists of 750 images (6 each). Table lists the face recognition accuracy rates in scenarios with occlusions (sunglasses and scarves) for six different methods. We can see that NNCW achieves best recognition accuracy rates for both occlusion conditions, though on the case with scarf occlusions the overall accuracy rate is not quite high. Sunglasses Scarves NN 69.87% 4.2% NB 75.39% 40.66% SVM 8.33% 45.48% SRC 86.28% 59.2% glasso 86.93% 6.37% NNCW 88.44% 62.9% Table. Comparison of face recognition accuracy rates with different occlusions (sunglasses and scarves) on AR database Recognition under different poses In this experiment, we evaluate six methods under different poses using CMU PIE face database. We use the frontal faces c27 as training data, and four test sets with increasing variations of pose angles, including test set (c05, c29), Figure 9. Comparison of face recognition accuracy rates on CMU PIE database. Figure 9 shows the face recognition results of six different methods. Test set contains the most near frontal images, so the recognition accuracy rates are the best. The results of the test set 2 and 3 are worse since the angle variations are larger than that for the test set. Test set 4 are almost for profile faces, which results in the worst recognition accuracy rates. The proposed NNCW method still achieves better performance than the others Image classification Caltech 0 image database contains 997 images from 0 various object categories, collected from Google image search by Li. et al. [5]. Most objects are centered and in the foreground. In order to make a robust comparison, we have selected 50 categories which contain more sample images than others, range from 60 to 800. Then we randomly chose 50 images per category for training and the remaining images are used as testing samples. We downsampled the images to 00 (0 0), 225 (5 5), 400 (20 20), and 625 (25 25)-dimensional feature vectors. From Figure 0, we observe that NNCW outperforms other methods for image classification on Caltech 0 image database. The classification accuracy of Caltech 0 is not as good as those on other face databases, since general image classification is more complicated and challenging Exploring sparseness To testify the sparse property of the proposed NNCW method, we also investigate the sparsity ratio of the estimated ˆb, defined as: Sparsity ratio = the number of zeros in ˆb the number of elements in ˆb Table 2 lists the average sparsity ratio of the estimated coefficients ˆb for different databases with SRC, glasso and NNCW. As can be seen, all the sparsity ratios are larger

8 References Figure 0. Comparion of image classification accuracies on Caltech 0 image database. than 0.5 for SRC, larger than 0.6 for glasso and NNCW. This indicates that the sparse representation takes effect in the classification task and NNCW achieves better sparsity than SRC and glasso in general. SRC glasso NNCW ORL Extended Yale B AR PIE Caltech Table 2. Comparison of average sparsity ratio between SRC, glasso and NNCW on different databases. 5. Conclusions and Future Work This paper proposed a novel sparse nonnegative image representation method, called the nonnegative curds and whey (NNCW). The NNCW method is attractive due to its natural sparsity along with its nonnegativity property and discriminating capability. The NNCW method consists of a set of the nonnegative garrote models, which are solved by using the numerical approach developed by Bremian [2]. In recent years, there are some sophisticated approaches to the nonnegative garrote such as the least angle regression [5] and pathwise coordinate method [8]. It would be interesting to implement our method via these approaches in our future work. Acknowledgements This work is supported by 973 Program (2009CB32080), National Natural Science Foundation of China ( ), National Key Technology R&D Program (2007BAHB0), Program for Changjiang Scholars and Innovative Research Team in University (IRT0652,PCSIRT). [] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatified relations in linear systems. Theoretical Computer Science, [2] L. Breiman. Better subset regression using the nonnegative garrote. Technometrics, 995., 3, 8 [3] L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression (with discussions). J.R.Statist. Soc.B, [4] D. Donoho. For most large underdetermined systems of equations, the minimal l -norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math., [5] B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. Least angle regression. Ann. Statist., , 8 [6] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, [7] M. Figueiredo. Adaptive sparseness for supervised learning. IEEE TPAMI, [8] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., , 8 [9] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI, [0] P. Hoyer. Nonnegative matrix factorization with sparseness constraints. JMLR, [] J. Kim and H. Park. Sparse nonnegative matrix factorization for clustering. Technical Report CSE Technical Reports; GT- CSE-08-0, Georgia Institute of Technology, [2] D. Lee and H. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 999., 2 [3] D. Lee and H. Seung. Algorithms for non-negative matrix factorization. In NIPS, 200. [4] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE TPAMI, [5] F. Li, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 0 object categories. In IEEE CVPR 2004, Workshop on Generative-Model Based Vision. 4, 7 [6] A. Martinez and R. Benavente. The AR face database. Technical Report 24, CVC, [7] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identification. In 2nd IEEE Workshop on Applications of Computer Vision, [8] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE TPAMI, [9] R. Tibshirani. Regression shrinkage and selection via the lasso. J.R.Statist. Soc.B, [20] C. Wang, S. Yan, L. Zhang, and H. J. Zhang. Multi-label sparse coding for automatic image annotation. In CVPR, [2] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE TPAMI, , 4, 5, 6 [22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J.R.Statist. Soc.B, , 4