BLOCK, GROUP, AND AFFINE REGULARIZED SPARSE CODING AND DICTIONARY LEARNING

Transcription

1 BLOCK, GROUP, AND AFFINE REGULARIZED SPARSE CODING AND DICTIONARY LEARNING By YU-TSEH CHI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

2 c 2013 Yu-Tseh Chi 2

3 To my father, my wife and my son 3

4 ACKNOWLEDGMENTS I am extremely grateful to Dr. Jeffrey Ho for his guidance and support during my graduate studies. He has been a constant source of inspiration and encouragement for me, the most important ingredients necessary for research. I am also thankful to Dr. Jorg Peters, Dr. Arunava Banerjee, Dr. Baba C. Vemuri and Dr. Thomas Burks for being on my supervisory committee and providing extremely useful insights into the work presented in this dissertation. I would like to thank the Department of Computer and Information Science and Engineering (CISE) and the University of Florida (UF) for giving me the opportunity to pursue my graduate studies in a very constructive environment. I am especially thankful to the CISE department for funding my doctoral studies and travels to various conferences. During my graduate studies, I enjoyed my job as a teaching assistant and for that I am grateful to Dr. Manuel Bermudez and Dr. Jonathan Liu for being a terrific boss. I also appreciate the camaraderie of my lab-mates Mohsen Ali, Shahed Nejhum, Muhammad Rushdi, Shaoyu Qi, Karthik Gurumothy, Venkatkrishinan Ramaswamy, Nicholas Fisher, Nathan Van Der Kraat, Subhajit Sengupta, Ajit Rajwade, Terry Ching- Hsiang Hsu and Hang Yu. It was a fun lab to work at. I am thankful to my long time friends in Taiwan MC Hsiao, Ahway Chen, Yan-Fu Kao, Yan-Sue Chiang, Odie Yu, Richard Hsu and Chin-Young Hsu for rooting for me. Lastly and most importantly, I am thankful to my family, for their unconditional and unflinching love and support. I will be eternally grateful to my father, Chi No and especially my wife, Aparna Gazula, for all that they have done for me. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 OVERVIEW AFFINE CONSTRAINED GROUP SPARSE CODING FOR IMAGE CLASSI- FICATIONS Introduction Theory and Method Theoretical Guarantee Part-Based ACGSC Related Works Experiments Face Classification Imposter Detection Face Recognition with Occlusions Texture Classification Future Work BLOCK AND GROUP REGULARIZED SPARSE MODELING FOR DICTIO- NARY LEARNING Introduction Related Work Methods Theoretical Guarantee Block/Group Sparse Coding Reconstructed Block/Group Sparse Coding Intra-Block Coherence Suppression Dictionary Learning Experiments and Discussions Hand-Written Digit Recognition Group Regularized Face Classification Unsupervised Texture Clustering CONCLUSION APPENDIX: PROOF OF THE THEORETICAL GUARANTEE

6 REFERENCES BIOGRAPHICAL SKETCH

7 Table LIST OF TABLES page 3-1 Classification error(%) with different structure on D Classification error(%) with different β in ICS-DL Comparison of classification error rate(%) between our approaches and others

8 Figure LIST OF FIGURES page 2-1 Illustration of the proposed Affine Constrained GSC framework Illustration of the standard GSC and our proposed ACGSC Comparison between the standard and part-based ACGSC Selected training and test samples from the Yale Extended B database Experiments results of our ACGSC and other methods on face recognition Reconstructed images and test samples from the face recognition experiment Precision v.s. Recall Curve of the imposter detection experiment Examples of detected imposters Results of the Part-based ACGSC on face recognition with occlusion Results of applying part-based ACGSC on face recognition with occlusion Selected images from the the cropped Curet database Classification rates of the texture classification experiment Illustration of the proposed Block/Group Sparse Coding algorithm Visualization of the sparse coefficients of the training samples Intra-block coherence of the dictionary and error rates Error rates (%) of the USPS dataset under five different scenarios Demonstration of selected training and test samples Results of the texture separation experiment A-1 The equivalence between X = D C and x = D c A-2 The equivalence between D [i], C [i] and D [i], c [i]

9 Abstract of dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy BLOCK, GROUP, AND AFFINE REGULARIZED SPARSE CODING AND DICTIONARY LEARNING Chair: Jeffrey Ho Major: Computer Engineering By Yu-Tseh Chi August 2013 In this DISSERTATION, I first propose a novel approach for sparse coding that further improves upon the sparse representation-based classification (SRC) framework. This proposed framework, affine constrained group sparse coding, extends the current SRC framework to classification problems with multiple inputs. Geometrically, the constrained group sparse coding essentially searches for the vector in the convex hull spanned by the input vectors that can best be sparse coded using the given dictionary. The resulting objective function is still convex and can be efficiently optimized using iterative block-coordinate descend scheme that is guaranteed to converge. Furthermore, I provide a form of sparse recovery result that guarantees, at least theoretically, that the classification performance of the constrained group sparse coding should be at least as good as the group sparse coding. While utilizing the affine combination of multiple input test samples can improve the performance of the conventional sparse representation-based classification framework, it is difficult to integrate this approach into a dictionary learning framework. Therefore, we propose to combine (1) imposing group structure on data (2) imposing block structure on the dictionary and (3) using different regularizer term to sparsely encode the data. We call this approach either block/group (BGSC) or reconstructed block/group ( R-BGSC) sparse coding. Incorporating either one of them with the novel Intra-block Coherence Suppression Dictionary Learning ( ICS-DL), which, as the name suggests, suppress the 9

10 coherence of atoms within the same block, algorithm results in a novel dictionary learning framework. An important and distinguishing feature of the proposed framework is that all dictionary blocks are trained simultaneously with respect to each data group while the intra-block coherence being explicitly minimized as an important objective. We provide both empirical evidence and heuristic support for this feature that can be considered as a direct consequence of incorporating both the group structure for the input data and the block structure for the dictionary in the learning process. The optimization problems for both the dictionary learning and sparse coding can be solved efficiently using block-coordinate descent, and the details of the optimization algorithms are presented. In both parts of this work, the proposed methods are evaluated on several classification (supervised) and clustering (unsupervised) problems using well-known datasets. Favorable comparisons with state-of-the-art methods demonstrate the viability and validity of the proposed frameworks. 10

11 CHAPTER 1 OVERVIEW In recent years, sparse signal representation has received a lot of attention. It has proven to be a powerful tool in the field of computer vision and machine learning [8, 45]. Its success is mainly due to the fact that images or image patches have naturally sparse representations with respect to global and pre-constructed bases (DCT, wavelet) or bases specifically trained[26]. In contrast to the eigen-coefficients, which are just projection of the input feature vector on the eigen space, calculation of the sparse representation is not straightforward. It involves minimizing the following equation: x Dc 2 2, s.t. c 0 s, (1 1) where D R n k is the dictionary, x is the input vector and c is the sparse representation of x. The above equation is combinatorial problem and is known to be NP-hard. Many alternative methods are to convexify it by replacing the l 0 -norm constraint with an l 1 -norm [4, 20] or an l q -norm (0 < q 1) [15, 34]. Many derivations of Eq. (1 1) has been proposed that impose extra structures placed on the dictionary D [11, 18, 33, 38]. The structures imposed on D change the regularizer term in Eq. (1 1). For example, Elhamifar et. al. [11] imposed block structure on the dictionary and the regularizer becomes i c [i] 2, where c [i] is the i-th block of the coefficient. Others proposed to combine multiple input (similar) features into a group i.e. x in Eq. (1 1) is a matrix which columns are the group of vectors [1]. This encourage the group of vectors to encode using the same basis (atoms) hence sharing the atoms. In the first part of this work, I will propose an affine constraint group sparse coding algorithm. This algorithm was incorporated with the sparse-representation based classification (SRC) framework to perform classification tasks. This novel framework take into the account that, in the age of information, data do not come in singles but in groups. For 11

12 example, in video surveillance, a duration of on second would provide about 30 frames. Therefore, there is a need to properly generalize SRC for classification problems that require decision on a group of test samples. One way to approach this problem is to consider the method proposed in [1]. The test samples are considered as one group and encode. However, in practice, the variability training samples and the test samples can be so large that the test samples do not lie in the subspace spanned by the training sample. The proposed algorithm take advantage of the convex set formed by the test samples. It is designed to capture any vector in the convex set that can best be sparse coded by the dictionary D. Recently Sprechmann et. al [38] proposed to impose structure on both the input data x and the dictionary D. The main focus of this report, however, is on recovering and separating mixture of signals. optimization proposed in [21]). They did not propose nor investigate the integration of dictionary learning algorithms. In the second part of this work, two alternative block and group regularized algorithms are proposed. I will prove that our proposed algorithms can indeed produce sparse representation for a group of input data. I will also provide efficient optimization methods for these two proposed algorithms. I will discuss the advantages and disadvantages of incorporating these sparse coding algorithms with dictionary learning algorithms. A novel dictionary learning algorithm will be presented to alleviate the problems caused by integrating the proposed sparse coding algorithms with dictionary learning algorithms. 12

13 CHAPTER 2 AFFINE CONSTRAINED GROUP SPARSE CODING FOR IMAGE CLASSIFICATIONS 2.1 Introduction Sparse representation-based classification (SRC) has been investigated in several notable recent work (e.g., [10, 46]), and despite the simplicity of the algorithm, the reported recognition results are quite impressive. The geometric motivation behind this approach is the assumption that data from each class resides on a low-dimensional linear subspace spanned by the training images belonging to the given class. The dictionary is obtained virtually without cost by simply stacking together the (vectorized) training images into a dictionary matrix D such that training images from a single class form a contiguous block in D (Fig. 2-1). During testing, a test image x is sparse coded with respect to the dictionary by optimizing a (convex) objective function E(c; x, D) of the sparse coefficient c that is usually a sum of l 2 -data fidelity term and a sparse-inducing regularizer: E(c; x, D) = x Dc 2 + λψ(c). (2 1) A plethora of sparse-inducing regularizers have been proposed in the literature (e.g., [19], [1], [40]), and many of the them are based on the sparse-promoting property of the l 1 -norm [5]. The block structure of the dictionary D together with the sparse coding of x allows one to infer the class label of x by examining the corresponding block components of c, and SRC essentially looks for the sparsest representation of a test image with the hope that such a representation selects a few columns of D from the correct block (class). However, for many image classification problems in computer vision, the current SRC as embodied by Eq. (2 1) has several inherent shortcomings, and the constrained group sparse coding model proposed and studied in this paper aims to further improve the effectiveness of SRC by addressing two of these shortcomings: its generalizability and its extension for multiple inputs. 13

14 Figure 2-1. Illustration of the proposed framework. Left: The convex hull formed by columns of X. The image corresponding to each column of X are shown and x (= X a ) is the solution determined by the proposed framework. x is the mean of columns of X. The magnitude of each component ai of a is shown as the white bar in the corresponding image. Right: Illustration of D c with selected atoms from the dictionary D and the sparse coefficients c of x shown on the right. Images of the same color are in the same block. In this age of information, data are plentiful and in many applications, test data do not come in singles but in groups. For example, in video surveillance, a duration of mere one second would provide about 30 frames of data with perhaps the same number of detected faces to be recognized. Therefore, there is a need to properly generalize SRC for classification problems that require decision on a group of test data x1,..., xk. On the other hand, for most computer vision applications, it is a truism that there does not exist a classifier that can correctly anticipate every possible variation of the test images. For image classification and face recognition in particular, these include variation in illumination, pose, expression and image alignment. In particular, subspacebased classification, for which SRC is a special case, is known to be sensitive to image misalignment [46]. Even for a small degree, misalignment can be detrimental and cause temperamental behavior of the classifier with unpredictable outcomes. Compounding the problem further, there is a desire to minimize the size of the dictionary for many reasons including computational efficiency, and the latter requirement will inevitably limit the dimension of the linear subspace for each class and therefore, reduce its generalizability. However, in many vision applications, the difference between the training images and anticipated test can be modeled using a small number of (linear) 14

15 transformations in the image space. For instance, lighting variation, 2D nonrigid image deformation and to some extent, pose variation can be approximately modeled using linear transformations in the image space. Therefore, a sensible solution for increasing the generalizability of D would be to obtain these crucial transforms 1 T 1,, T k : R k R k during training, and applying T i to the single test image x to obtain a group of images x 1,..., x k, from which a classification decision will be determined. Somewhat surprisingly, this yields another instance of classification problems with multiple inputs. Multiple inputs x 1,..., x k provides a new and different setting for SRC, and a straightforward application of SRC would be to apply SRC individually to each x i and generate the final classification decision by pooling or voting among the individual classification results. This approach is unsatisfactory because by treating x i independently, it completely ignores the (hidden) commonality shared by these obviously related data. Group sparse coding [1] (GSC) offers a solution that takes into account the potential relation among x 1,..., x k by sparse coding all data simultaneously using the objective function min C E(C; X, D) = X DC 2 + λψ(c), (2 2) where X is a matrix formed by horizontally stacking together x i, and ψ(c) is a appropriate l 1 /l 2 -based regularizer. The matrix C of sparse coefficients can be used as in SRC to generate classification decision by applying e.g., voting and pooling across its rows. However, there are two undesirable features: the effect of ψ on the matrix C is difficult to predict and understand, and the pooling and voting still cannot be avoided. For classification problems in computer vision, this paper argues that a modification of the group sparse coding, constrained group sparse coding, using the following objective 1 We assume that the identity is among the transformations 15

16 function, offers a more principled and flexible approach: E(a, c; X, D) = Xa Dc 2 + λψ(c), (2 3) where a = [a 1..., a k ], is a k-dimensional vector with nonnegative components satisfying the affine constraint a a k = 1. Comparing with GSC as in Eq. (2 2), the constrained GSC enlarges its feasible domain by including a k-dimensional vector a. However, the feature vector c used in classification is in fact a vector not a matrix. Geometrically, constrained GSC is easy to explain as it simply searches for the vector in the convex hull S generated by the x 1,...x k that can best be sparse coded using the dictionary D. Comparing with GSC, the classification decision based on Eq. (2 3) does not require pooling or voting. The argument in favor of constrained group sparse coding relies mainly on a form of sparse recovery guarantee presented in Theorem One below. The theorem effectively shows that for any sparse vector x among the columns of X, if it can be correctly classified using GSC by minimizing Eq. (2 2), then in a larger domain S, it will still be the global minimum of Eq. (2 3) with the same parameter λ. As will be made more precisely later, this will allow us to argue that, at least in theory, the classification performance of constrained GSC should be at least as good as the one based on GSC or Eq. (2 1). We conclude the introduction by summarizing the three explicit contributions made in this paper: 1. We propose a new sparse representation-based classification framework based on constrained group sparse coding, and it provides a principled extension of the current SRC framework to classification problems with multiple inputs. The resulting optimization problem can be shown to be convex and can be solved efficiently using iterative algorithm. 2. We provide some theoretical analysis of the proposed SRC in the form of a sparse recovery result. Based on this result, we argue that theoretically, the classification 16

17 performance of the proposed framework should be equal or better than other existing SRC frameworks such as group sparse coding GSC. 3. We evaluate the proposed framework using three face recognition-related experiments. The results suggest that the proposed framework does provide noticeable improvements over existing methods. This paper is organized as follows. We will present the affine constrained group sparse coding (ACGSC) framework in the next section. The associated optimization problem and its solution will also be discussed. We provide a brief survey of the related work in section three. The experimental results are reported in section four, and we conclude the paper with a short summary and the plan for future work. 2.2 Theory and Method Let x 1,...x k denote a group of input test data, and D is the given dictionary. We further let X denote the matrix X = [x 1 x 2... x k ]. Our proposed affine-constrained group sparse coding seeks to minimize the objective function: E CGSC (a, c; X, D) = X a D c 2 + λ Ψ(c) (2 4) subject to the nonnegative affine constraint on the group coefficient a ( k i=1 a i = 1, and a 1, a 2,..., a k 0). Note that in group sparse coding [1] (GSC), there are no group coefficients a and the sparse coefficients c are given as a matrix. A schematic illustration of the difference between the group sparse coding and our constrained version is shown in Fig The GSC-based classification scheme sparse codes the input feature vectors x i simultaneously. While some group sparsity can be claimed for this approach based on the appropriate regularizer Ψ, it is generally difficult to provide any guarantee on the behavior of the sparse coefficient matrix C. On the other hand, for our constrained version, the discrete set of the input vectors has been completed to form a convex set S, and our approach is designed to capture any vector in this convex set that can best be sparse coded by the dictionary D. The situation here shares some similarity with 17

18 Figure 2-2. Illustration of the difference between group sparse coding and constrained group sparse coding, and its effect on classification results. The cone represents the subspace spanned by a block of the dictionary D. Shaded areas represent the convex hull spanned columns in X. None of the x i lie within the subspace; however some of the points on the convex hull do and the proposed algorithm is meant to capture these points. the LP-relaxation of integer programming [44] or the convexification of a nonconvex program [35], in which one enlarges the feasible domain in order to achieve convexity and thereby, efficiently compute approximate solution. We remark that the affine constraint is quite necessary in Eq. (2 4), and without it, there is always the trivial solution a = 0, c = 0. It is clear that the optimization problem is indeed convex and it is completely tractable as the feasible domain and objective function are both convex. We iteratively solve for a and c using gradient descent, and this scheme is guaranteed to converge. The only complication is the projection onto the simplex defined by the group coefficient constraint a a k = 1, and this step can be efficiently managed using an iterative scheme described in the supplemental material Theoretical Guarantee Given a dictionary D, a vector x has sparsity s if it can be written exactly as a linear combination of s columns of D. An important result that underlies all SRC frameworks is the guarantee provided by the sparse recovery result that for a feature vector x with sparsity bounded by properties of D [7, 11], x can be recovered by minimizing the l 1 cost-function: 18

19 (P 1 ) min c c 1 subject to Dc = x. (2 5) In actual application, the above l 1 -program is modified as (P λ 1 ) min c x Dc λ c 1, (2 6) for a suitably chosen constant λ > 0. We remark that the two programs, while related, are in fact different, with most sparse recovery results given by minimizing (P 1 ). Let x be a noiseless test vector to be classified. A typical SRC method will determine its classification based on its sparse coefficients obtained by minimizing the program (P λ 1 ). Compared to them, our proposed framework enlarges the optimization domain by introducing the group coefficients a, and it is possible that with larger domain, spurious and incorrect solutions could arise. The following theorem rules out this possibility, at least when the sparse vector x can be exactly recovered by a typical SRC method and classified correctly: Theorem 2.1. Let x be a feature vector with sparsity s such that it can be exactly recovered by minimizing P λ 1 for some λ. Furthermore, we assume that x is in the convex hull S. Then, x is the global minimum of E CGSC with the same λ and D. Proof. The proof is straightforward and it consists of checking the sparse vector x also corresponds to the global minimum of E CGSC (a, c). Since x S, we have x = Xa for some feasible a. Since x is a sparse vector that can be recovered exactly by minimizing (P1 λ ) in Eq. (2 6), we let c be its sparse coefficients, and we have x = Dc. We claim that (a, c) is a global minimum of E CGSC (a, x) by showing that the gradient vanishes at (a, c). First, since c is the global minimum for (P1 λ ) with x = Xa, and c (P1 λ ) = c E CGSC (a, c), therefore the c-component c of the gradient E CGSC vanishes at: c E CGSC (a, c) = 0. 19

20 On the other hand, by direct calculation, we have a-component of the gradient E CGSC a E CGSC (a, c) = X Xa X Dc = 0, because Dc = x = Xa. This shows that (a, c) is the global minimum of the convex function E CGSC (a, c), regardless whether x is on the boundary of the convex hull S. From the above, we can draw two important conclusions. First, comparing to GSC, our constrained version, with an enlarged feasible domain, will indeed recover the right solution if the (noiseless) solution is indeed among the input feature vectors x 1,...x k. Therefore, our method will not produce incorrect result in this case. However, the behavior of GSC in this case is difficult to predict because other (noisy) input feature vectors will affect the sparse coding of the (noiseless) input vector, and the result of the subsequence pooling based on the matrix C is also difficult to predict. Second, if there is a sparse vector x lying inside the convex hull S spanned by x 1,...x k, our method will indeed recover it (when the required conditions are satisfied) Part-Based ACGSC The ACGSC framework based on Eq. (2 3) is versatile enough to allow for several variations, and here we discuss one variation, the part-based ACGSC, that is specifically suitable for detecting the presence of occlusions. One variation of Eq. (2 3) is k E(A, c; X, D) = A i x i Dc 2 + λψ(c), (2 7) i where A is the set of all A i, A i are diagonal matrices with nonnegative elements, k is the number of input samples, and x i R d is the i-th column in X. The affine constraints on the A i are k i Aj i = 1 for j = 1 d, where A j is the j-th diagonal element of A. The resulting vector i A ix i is the element-wise affine combination 2 of x i s. 2 i A ix i = i diag(a i) x i, where denotes element-wise product. 20

21 A B Figure 2-3. Comparison between the standard (a) and the part-based ACGSC (b). (a): a i are the group coefficients corresponding to the sample x i. The nonnegative affine constraint here is a 1 + a 2 = 1. (b): The same input samples are split into 4 parts in the part-based approach. There are 4 nonnegative affine constraints i.e. a p 1 + a p 2 = 1 for p = 1 4. Although Eq. (2 7) provides an extension of Eq. (2 3), it is severely underconstrained as there are d k unknowns in all the A i. To alleviate this problem, we can further reduce the number of variables in A i. For example, the equation below gives only n p different variables: a j i are scalar variables, and I p are identity matrices of certain sizes, a 1 i I a 2 i I 2 0. A i =. (2 8) a np i I np This formulation of A i is equivalent to splitting a sample (x i ) into n p parts. Each part of x i corresponds to a scalar variable a p i. The size of a part is equal to the size of the corresponding I p. Note that each I p does not necessarily have the same size. Let I p denote the set of indices of the rows in A i corresponding to a p. Eq. (2 7) can be rewritten as: 21

22 X (I1) X (I2) X (Inp ) a 1 a 2. a np Dc 2 +λψ(c) k s. t. a p i = 1 for p = 1 n p and a p i 0, (2 9) i=1 where a p = [a p 1, a p 2,, a p k ] and X (Ip) are the rows in X that correspond to the p-th part. A comparison between the standard and the part-based ACGSC is illustrated in Fig Because the first part of the data fidelity term is still a d-dimensional vector, optimization of c is the same as in Eq. (2 3). Although Eq. (2 9) and Eq. (2 3) have a similar structure, the former has part-structure defined on the components of a, and in practice, the parts are specified by each individual application. For face recognition, we can define the parts according to the image regions where the useful features such as eyes, nose and mouth are to be found. Since the first matrix in Eq. (2 9) is block-diagonal, Eq. (2 9) can be rewritten as: n p p=1 X (Ip) a p D (Ip) c 2 + λψ(c), (2 10) where D (Ip) are the rows of D corresponding to rows of X (Ip). The vector a p can then be optimized individually under the nonnegative affine constraints given in Eq. (2 9). Note that the indices corresponding to a p in A, as shown in Eq. (2 9), do not have to be contiguous. This provides us more flexibility for determining and specifying useful parts, depending on the intended application. 2.3 Related Works To the best of our knowledge, a similar framework and algorithm to the one proposed in this paper have not been reported in the computer vision literature. We will keep our presentation succinct and to the point. In particular, we will focus primary on the 22

23 presentation of the algorithm as well as the experimental results using real image data. For limited space, we will only summarize the major differences between our work and some of the representative works in dictionary learning and sparse representation that have appeared in the past few years. Sparse representations have been successfully applied to many vision tasks such as face recognition [24], [31], [42], image classification [43], [10], [14], [27], [28], [32], [48], denoising and inpainting [25], [2] and many other areas [45], [51]. The success is likely due to the fact that sparse representations of visual signal are produced in the first stage of visual processing of human brain [29]. In many cases, simply replacing the original features with its sparse representations leads to surprising better results [45]. Moreover, in many of the applications, they require only unlabeled data during the training phase (Dictionary Learning). While many of the works focusing on replacing the original dense features with the sparse coding, some proposed adding structured constraints either on the sparse representations [43], [1] or on the dictionary [19], [37]. In the work of Wang et al. [43], a feature is coded with atoms in the dictionary that are local to the feature. Although there is no sparsity promoting regularizer term (l 1 norm on coefficients) in their formulation, their locality constraints promote coding of a feature by its local neighbors which in turn promote sparsity. This results in a state-of-art performance in the image classification task. In the work of Bengio et al.[1], a sparsity promoting matrix norm is imposed on the coefficients of data that belongs to the same group. This sparsity promoting matrix norm encourage features in the same group to share the same atoms (code words) in the dictionary. During the sparse coding phase, this process helps to identify the atoms that are commonly used by images within the same group. In a different context, we assume sampled data is under certain perturbations i.e. lighting variation, or occlusions. We treat the these perturbations as a group and apply a constrained sparse coding scheme that would help to identify the underlying features among them. 23

24 2.4 Experiments Face Classification SRC based face classification has been extensively studied and state-of-the-art results have been established in the past [10, 46]. However, none of the works investigated the more realistic scenario that there can exist large variability between the dictionary (training samples) and the test samples. In this experiment, we used the cropped Extended Yale Face Database B[21] to simulate such scenario. This database contains aligned images of 38 persons under different laboratory-controlled illumination conditions. For each subject, we chose the images with the azimuth and elevation angles of the light source 35 o as the training samples. The rest of the images in the database, which contain large amount of shadows, were used as test samples. Figure 2-4. Selected training (top row) and test (bottom row) samples. Numbers in the parenthesis are the azimuth and elevation angles. The training samples were used to simulate the well-lit images such as passport or I.D. photos. The test samples were used to simulate poorly acquired images (e.g. from surveillance camera) that are very different from the training data. Fig. 2-4 demonstrates the large variability between the chosen training (top row) and test (bottom row) samples. 24

25 We used the training samples directly as atoms of the dictionary D, as in [10, 46]. Therefore, there are 38 blocks in D and each block contains 24 or 23 atoms 3. The number of test samples for each person is around 40. The experiment was conducted as follows: 1. Reduce dimensionality of the data to 600 using the PCA. Normalize the samples to have unit l 2 norms. 2. Use the training samples directly as atoms of the dictionary D. 3. For each subject, randomly select n g = [2 7] numbers of test samples (X). 4. Initialize a = [ 1 n g,, 1 n g ] and c = Iteratively update a and c. 6. Determine the class label by label(x) = min i Xa D i c i 2, (2 11) where D i and c i are the i-th block of D and c, respectively. 7. Repeat until all test samples are being used for evaluation. We ran the above experiment 10 times and the results are reported in Fig We compared our result with the result of simply using the mean of columns of X as input vector (last column of Fig. 2-6). We also compared with two group regularized sparse coding algorithms proposed by Bengio et. al.[1], and by Sprechmann et.al.[38]. In [1], the energy function they minimized is in the same form as Eq. (2 2): D X DC 2 F + λ C i 2, (2 12) where C i is the i-th row of C. Their algorithm promotes the data in a group (columns of X) to share same dictionary atoms to encode. In [38], on top of the group structure, they i 3 Missing samples in some categories. 25

26 added a block structure on D. The energy function they minimized is also in the same form as Eq. (2 2): n b X DC 2 F + λ C b 2, (2 13) where C b is the rows in C that corresponds to the b-th block of D. Similar to previous algorithm, Sprechmann s algorithm promotes the data in X to share same blocks of D to encode. We also compared with the results using algorithms from the works of Wright et. al.[46] and of Elhamifar et. al.[10]. We applied these two methods to every test sample since they did not utilize group structure on data and the results are also reported in Fig The results show that our proposed method significantly outperformed other algorithms. The reason is, as shown in Fig. 2-2, our proposed method provides a larger feasible domain (the simplex spanned by X) while the group regularized method can only rely on few atoms. The regularizer parameters (λ) for each method are listed in Fig Fig. 2-6 demonstrates four groups of test samples, the actual computed coefficients a (white bars), the image at optimality, Xa, (2 nd column from the right) and the mean image (last column). The images at optimality have the lighting conditions that are more similar to the atoms in D (top row in Fig. 2-4) than the mean images if the simplex spanned by columns of X lies within the subspaces spanned by the blocks in D. The first three rows demonstrate successful examples by our methods. We were not able to classify the 1 st and 3 rd correctly by using the mean of X because the majority of the images in the group are too dark. The bottom row shows a failed case where none of the samples contains much distinguishable features. b 4 The results in Fig. 2-5 of these two methods are worse than what were reported in the original literatures ([46] and [10]). This is because, in their experiments, they randomly chose half of the dataset as atoms of D and the other half as test samples. Therefore their dictionaries are 50% larger than ours and the variability between training and test samples are minimized due to the random selection. 26

27 Classification Rate(%) Ours Sprechmann Mean of X Bengio Wright Elhamifar n g Figure 2-5. Comparison of our method with different sparse coding algorithms. The λ for each method, in the order of the legend, are 0.05, 0.2, 0.05, 0.1, 0.05, and 0.05, respectively. The results also show that samples which are more similar to atoms in the dictionary usually have higher a values. In fact, if a sample does NOT provide any useful features that can trigger a match from the dictionary, its corresponding a values will often be driven to zeros (e.g. 2 nd and 3 rd images in 1 st row) even though we do not put any sparseinducing constraint on a Imposter Detection From the results in the previous section, we observed that samples in X have higher corresponding values in a if they are more similar to the atoms in the dictionary. Now, let s assume we acquired a collection of samples that contain n k class-k samples and n j samples that are not in class-k. We call these n j samples the imposters. One straightforward way is to apply our proposed algorithm to these n k + n j samples. The samples with lower corresponding a i values would imply that they are the imposters. This approach would work well if the n j samples are not close to other blocks in the dictionary e.g. the n k samples are face images of a person and the n j samples are images of other objects. However, when the n j samples are also very similar to some dictionary atoms, this method becomes unreliable. We therefore modify our algorithm slightly to suit 27

28 } {{ } X } {{ } Xa } {{ } mean(x) Figure 2-6. Left: columns of X and the values of the computed a (white bars). The last two columns are the results of X a and the mean of columns of X, respectively. The first three rows this purpose. When updating the affine coefficient a, instead of using the whole dictionary, we compute a using a = (X X) 1 X D i c i, (2 14) where D i and c i are the i-th block of D and c, respectively and the i-th block is the block that has the minimum reconstruction error to Xa. The value of i can be computed using Eq. (2 17). We compared with the results using the algorithm proposed by Sprechmann et. al. [38] that minimizes Eq. (2 13). As mentioned in previous section, their algorithm promotes X to use the same few blocks in D to encode. In other words, the imposters in X are forced to use the same blocks as other inliers for encoding. This will likely 28

29 result in a larger reconstruction errors for the imposters as they use the blocks that are irrelevant to them (but relevant to the inliers) for encoding. In this approach, we used the reconstruction errors to determine imposters. To make the task more challenging, instead of following the set-up as previous section. We used the AR-face dataset [30]. This dataset contains face images of 100 individuals. There are 14 images from each individual with different expressions and illumination variations in terms of lighting intensity not in directions as the Extended Yale Database B. For each person, we randomly chose n g images as the test samples and the rest as training samples. The experiment was conducted as follows: 1. Project the data sample down to 600 dimension using the principal components of the training samples. 2. Use the training samples as atoms of D directly. D has a 100-block structure. 3. For the p-th person, pick n g images from its test samples and pick randomly n i imposter images from the test samples of other people. These n g + n i images are columns of X. 4. Initialize a = 1/(n g + n i ), c = 0 and C = Iteratively update c (sparse coefficient of Xa) and a. Use Eq. (2 14) to update a instead. 6. Determine the imposters by comparing the values in a with a threshold value ɛ 1. X i is an imposter if a i < ɛ Compute the sparse coefficients C of X using the algorithm in [38]. 8. Compute the reconstruction errors e of each column of X: e j = X j D i C j i 2, where X j is the j-th column of X and C j i is the i-th block of the j-th column of C. i is the index of the active block for data X. An active block is determined by active-block(x) = min i X D i C i F, (2 15) where C i are the rows that correspond to i-th block of D. 29

30 Precision Precision Precision Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =3) Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =4) Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =5) Precision Precision Precision Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =7) Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =10) Ours 0.2 Sprechmann A.P. Ours A.P. Sprechmann Recall (n i =4*) Figure 2-7. Precision v.s. Recall curve and average precision (A. P.) comparisons of various n i. In the bottom-right graph, all the n i imposters are from one single class. 9. X j is an imposter if e j > ɛ 2. We repeat the above procedure 10 times for each class. The λ for our algorithm and the algorithm in [38] are set to 0.1 and 0.5, respectively. n g is 5 and n i = [3 10]. The Precision v.s. recall curves and the average precisions (areas of the curves) of both methods are listed in Fig The results show that our detection performance is slightly better than that of Sprechmann s. The average precision n i = 3 (80%) does not differ much from that of n i = 10 (74%). This is because when we chose the imposters in step 3, the imposters were chosen from across all other classes. Therefore, the inliers remain the majority class in X. This is how a standard data acquisition system should normally work. We conducted another experiment with n g = 5 and n i = 4 but all the four imposters are chosen from one single class. The result is shown in the bottom-right graph in Fig The average precisions of both methods are significantly worse. 30

31 } {{ } } {{ } Inliers Imposters Figure 2-8. Imposters detection results. First three row: detection results using our affine model. Last row: Detection result using reconstruction errors. The imposters are listed in the last two columns. The third row is a failed case where the 2nd sample from the left was identified as an imposter. We show some selected results with their a (first 3 rows) and e (last row) values in Fig For presentation purposes, we only show two imposter data samples. The results show that imposters usually have lower a or higher e values. The third row demonstrates a failed case where the facial expressions of both imposters are very similar to one of the inliers (4-th from the left). This imposter detection framework could be easily extended for other applications such as unusual event detection [50]. It is, however, behind the scope of this paper. We will leave it for future work Face Recognition with Occlusions We have tested our proposed approach using the AR-Face database[30]. This dataset contains face images of 100 individuals. There are 14 non-occluded and 12 occluded 31

32 images from each individual with different expressions and illumination variations. The occluded images contains two types of occlusions: sun-glasses and scarf covering the face from the nose down. Each occlusion type contains 6 images per person. To reduce the dimensionality, we down-sampled the images to and vectorized them. We randomly selected 8 non- occluded images from each person to form the dictionary D with a 100-block structure. The experiment was performed as follows: 1. Randomly selected n g test samples (X) from the occluded images of person p. They must contain at least one from each type of occlusion. 2. Split the test images into 6 uniformly-sized and non-overlapping parts (Fig. 2-10D). 3. Initialize A i = I/n g (a p = 1/n g ), c = Iteratively optimize c and A i (a p s) using Eq. (2 9) and Eq. (2 10), respectively. 5. Determine the class label by label(x) = min i n p p=1 X (Ip) a p D (Ip) i c i 2, (2 16) where D i and c i are the i-th block of D and c, respectively. This equation is a suitable modification of Eq. (2 17). We repeated the above procedure 20 times for each person. We compared the results with that of using standard ACGSC. We also compared our results with those of GSC [1] and of Sprechmann [38]. Both algorithms treat the multiple test samples as a group and sparse coded them simultaneously. Lastly, we compared with the results from directly applying regular sparse coding (Wright et. al. [46]) and block sparse coding (Elharmifar et. al. [10]) on the average test samples (Fig. 2-10B). The results in Fig. 2-9 show that our part-wise ACGSC outperforms other methods by a significant margin. The standard ACGSC does not have significant advantage over other methods. This is due to the fact that the occlusions are present in all the test samples. Fig. 2-10D shows the part-based group coefficients (a p ) of the test samples after the optimization. The values are displayed using the colors overlaid on the corresponding parts. The largest coefficient of this specific example is around 0.6. We can clearly see 32

33 100 Classification rate (%) Ours w/ parts 40 Ours w/o parts Elharmifar Wright 20 GSC Sprechmann n g Figure 2-9. Comparison of classification results. The λ values used in the methods, in the order of legends, are 0.005, 0.005, 0.004, 0.002, 0.01, and 0.02, respectively. that the parts corresponding to the occluded regions have significantly low values, i.e., our method correctly identified the occlusions. Fig. 2-10E shows the part-based affine combination of the test images. Our part-based approach was able to select the parts that are more consistent and aligned with the dictionary (training samples). Fig. 2-10F shows the reconstructed image using our method and Fig. 2-10C shows the reconstructed image by directly applying Wright s method [46] on the average image (Fig. 2-10B). Due to the occlusion of the scarf, the test samples were incorrectly matched to training samples from a male subject with beards and mustache Texture Classification In this experiment, we used the cropped Columbia-Utrecht (Curet) texture database [41] 5. This database contains images of 61 materials which broadly span the range of different surfaces that we commonly see in our environment (See Fig. 2-11). It has a total of = 5612 images. Each image is of size 200-by-200 pixels. For each texture, we randomly chose 20 images as the training samples and the rest as test samples. The experiment was conducted as follows 1. For each image, compute its SIFT 6 feature over the entire image. Each image is simply represented by one 128-dimensional SIFT vector. 5 Images can be downloaded at vgg/research/texclass/setup.html 6 We used the vl sift package. 33

34 A B C D E F Figure A: Input test samples. B: The average of the 3 test samples. C: Reconstructed image using Wright s method on the average image B. D : Weights of the part-wise group coefficients overlaid on the corresponding test samples. The redder the shade is, the higher the affine weight is. E: The part-wise affine combination of the test samples after the optimization. F: Reconstructed image using our proposed approach. Figure Selected images from the the cropped Curet database. Top row: 5 different types of textures. Bottom row: same texture under different illumination conditions. 34

35 2. Normalize the SIFT vectors to have unit l 2 norm. Truncate large elements in these vectors by setting the elements with values > 0.25 to Normalize them again. 3. Use the training samples directly as atoms of the dictionary D. D is of size 128 1, 220 and has a 61-block structure. 4. For each class, randomly choose n g number of test samples (X). 5. Iteratively update c and a in Eq. (4) in the main text. The regularizer λ to compute c is set to Repeat the above two steps such that all test samples in a given class are used for testing. 7. Determine the class label of X by label(x) = min i Xa D i c i 2, where D i and c i are the i-th block of D and c, respectively. We compared our results with the results using the framework from the works of Wright et. al. [46]. Since there is no group structure defined in their framework. We computed the sparse coefficients of all the test samples individually. The λ of this algorithm is set to The class label is determined by using the following equation: label(x) = min i x D i c i 2. The classification results are listed in Fig The results are surprising well for such simple features (SIFT over the whole texture image). The result from using Wright s framework is comparable to the state-of-the-art results [41] for this dataset. Our framework further improved the result of Wright et. al. by 3.3% (classification rate is 99.67% when n g = 5). 35

36 Classification Rate(%) n g Ours Wright Figure Classification rates of the texture classification experiment. 2.5 Future Work For future work, we will investigate more theoretical aspect of the approach. We believe that it is possible to obtain a stronger form of the sparse recovery result under noisy assumption, providing a better understanding of the power and limitation of the proposed algorithm. From the practical side, we will also investigate efficient methods and strategies for determining the collection of relevant transforms that can be applied online to improve the robustness and accuracy of SRC-based approach as described in the introduction. Furthermore, we will also investigate useful and effective prior for the group coefficients a and the resulting (usually nonconvex) optimization problem. From the first experiment, we have observed that the group coefficients a tend to have only few large elements and the rest are very close to zero. It makes sense to put a sparse-inducing prior that does not violate the original constraints on a. One immediate candidate would be the Dirichlet prior on the simplex spanned by a, with the modified objective function E CGSC (a, c; X, D) = X a D c 2 + λ ψ(c) + µdir(a), 36

37 where DIR is the density function of the Dirichlet distribution on the unit simplex in R k. 37

38 CHAPTER 3 BLOCK AND GROUP REGULARIZED SPARSE MODELING FOR DICTIONARY LEARNING 3.1 Introduction Sparse modeling and dictionary learning has emerged recently as an effective and popular paradigm for solving many important learning problems in computer vision. Its appeal stems from its underlying simplicity: given a collection of data X = {x 1,, x l } R n, learning can be formulated using an objective function of the form: Q(D, C; X) = g X (g) DC (g) 2 F + λ D Ψ(D) + λ C Ω(C (g) ), (3 1) where the X (g) are vectors/matrices generated from the data X, and Ψ, Ω are regularizers on the learned dictionary D and sparse coefficients C (g), respectively. In dictionary learning, Ω(C) is usually based on various sparsity-promoting norms that depend on the extra structures placed on D, and it is the regularizer Ψ(D) that largely determines the nature of the dictionary D. It is surprising that such an innocuous formula template has generated an active and fertile research field, a testament to the power and versatility of the notions that underlie the equation: linearity and sparsity. If Eq. (3 1) provides the elegant theme, its variations are often composed of extra structures placed on the dictionary D ( [11, 18, 33, 38]), and less frequently, different ways of generating sparsely-coded data X (g) for training the dictionary. The former affects how the two regularizers Ψ, Ω should be defined, and the latter determines how the vectors/matrices X (g) should be generated from X. For classification, a block structure is often imposed on D and hierarchical structures could be further specified using these blocks ([18, 39]), with the aim of endowing the learned dictionary certain predictive power. To promote sparsity, the block structure on D is often accompanied by an appropriate block-based l 2 -norm (e.g., l 1 /l 2 -norm [49]) used in Ω(C). On the other hand, for X (g), a common approach is to generate a collection of groups of data vectors {x g1,, x gk } 38

39 Figure 3-1. Illustration of the proposed Block/Group Sparse Coding algorithm. A group of data X (g) on the left is sparsely coded with respect to the dictionary D with block structure D [1] D [b]. from X and to simultaneously sparse code the data vectors in each data group X (g) [1]. For classification problems, the idea is to generate data groups X (g) with feature vectors x gi that should be similarly encoded, and such data groups X (g) can be obtained using problem-specific information such as class labels, similarity values and other information sources (e.g., neighboring image patches). In a noiseless setting, our proposed problem of encoding sparse representations for a group of signals X using the minimum number of blocks from D can be cast as the optimization program: P l0,p : min C I( C [i] p ) s. t. X = DC, (3 2) i where I( ) is an indicator function, p = 1, 2 and C [i] is the i-the block (sub-matrix) of C that corresponds to the i-th block of D as shown in Fig This combinatorial problem is known to be NP-hard, and the l 1 -relaxed version of the above program is: P l1,p : min C C [i] p s. t. X = DC. (3 3) i We will call this program Block/Group Sparse Coding (BGSC) as it incorporates both the group structure in data and block structure in the dictionary. 39

40 In some applications of which the main concern is identifying contributing blocks rather than finding the sparse representation [11], the following optimization program is considered: P l 0,p : min C Again, this program is also NP- Hard and its l 1 relaxation is: P l 1,p : min C I( D [i] C [i] p ) s. t. X = DC. (3 4) i D [i] C [i] p s. t. X = DC. (3 5) i We will call the programs P l 0,p and P l 1,p Reconstructed Block/Group Sparse Coding ( R-BGSC) as they focus on minimizing the norm of the reconstruction term ( D [i] C [i] ). The optimization algorithms for solving P l1,p and P l 1,p will be presented in Sec. 2. Due to limited space, we will only summarize the major differences between our work and some of the representative approaches in dictionary learning and sparse representation that appear in recent years. The group sparse coding was first introduced in [1]. Elhamifar and Vidal [11] explicitly imposed block structure on the dictionary for classification. However, none of these approaches investigated neither the combined framework that incorporated both the group and block structures nor the effect of combining these two structures on dictionary learning. Sparse Representation based Classification (SRC) was studied recently in [10, 46]. However, the dictionary is simply columns of the training data and there is no emphasis on minimizing the intra-block coherence. Although the work in [33] shares some superficial similarities with our dictionary learning algorithm, the differences are major. First, Ramirez et. al. train each block using different collection of data and therefore, there is no notion of training blocks of D simultaneously as in our framework. And because of this, the main objective in [33] is the minimization of inter-block coherence instead of the intra-block coherence. Finally, Sprechmann et. al. [38] proposed a sparse coding scheme that is similar to ours BGSC (using the proximal optimization proposed in 40

41 [47]). However, they did not propose nor investigate the integration of dictionary learning algorithms as their work is focused on signal recovery and separation. Sharing of dictionary atoms for data in the same group had been proved to increase the discriminative power of the dictionary ([1]). With the block structure added on dictionary D, our proposed BGSC and R-BGSC algorithms promote a group of data to share only few blocks of D for encoding. Therefore, incorporating these SC algorithms in a dictionary learning framework, which iteratively updates coefficients of data and updates atoms of D, will result in training each block of D using only few groups of data. This means that a badly written digit 9, which looks like a 7, when grouped together with other normally written 9 s, will be encoded using atoms these 9 s used. The badly written 9 will, in turns, be used to train the atoms in D that represent 9 s rather than those that represent 7 s. Another novelty of our framework is that we do not assign a class of signals to specific blocks of a dictionary, unlike other Sparse Representation based Classification (SRC) [10, 46] and [33]. This would allow some blocks to store shared features between some different classes. Ramirez et. al. [33] trained a single dictionary block for each group of data. This method increases the redundancy of the information encoded in the learned dictionary as the information common to two or more groups (a common scenario in many classification problems) will need to be stored separately within each block. Since one dictionary block is assigned to each class, the redundancy induced in the dictionary needs to be reduced for greater efficiency. This is performed by removal of dictionary elements whose mutual dot product has an absolute value greater than an arbitrary threshold (e.g. 0.95). Instead, we provide an objective function whose minimization naturally produces dictionaries that are less redundant. In fact, our proposal to encode data from a single class using multiple blocks obviates the need to even incorporate an explicit inter-block coherence minimization term unlike [33]. By not assigning classes to blocks and generally having more blocks in D than the number of classes, we allow mutual blocks that contain 41

42 features from more than one block. As well, data with more variability is allowed to use more blocks. As proved in [9], the program P l1,p (Eq. 3 2) is equivalent to P l0,p (Eq. 3 3) 1 when n a (2k 1)µ B < 1 (n a 1)µ S, (3 6) where n a and k are the size and the rank of a block, respectively and µ B and µ S are interand intra-block coherence defined in Section 3.2.4, respectively. In other words, the smaller µ S is the more likely the two programs can be equivalent. A way to achieve minimum µ S is to make atoms orthonormal within each block [6, 23]. However, such dictionaries (over-complete dictionary with union of orthonormal basis) do not perform as well as those with more flexible structure [36]. For example, in SRC-based face recognition, each block contains atoms to describe faces of the same person. It does not make sense to impose strict orthogonality on each block. Therefore, rather than imposing strong orthogonality constraint on each block, we proposed a dictionary learning algorithm only to minimize intra-block coherence. We will elaborate the effect of our framework on intra-block coherence later in Sec The proposed dictionary learning framework learns the dictionary D by minimizing the objective function given in Eq. (3 20), and the third term in Eq. (3 20) measures the mutual coherence within each block of D. The corresponding sparse coding can be either BGSC or R-BGSC. Besides the novel sparse coding algorithms, BGSC and R-BGSC, there are three specific features that distinguish our dictionary framework from existing methods: 1. Instead of inter-block coherence, the proposed ICS-DL algorithm presented in Sec minimizes the intra-block coherence as one of its main objectives. 1 They proved the case when p = 2 and X is a single vector. In Section 3.2.1, we will show that the condition still holds when X is a matrix. 42

43 2. Our framework does not require to assign a class or a group of data to block(s) in the dictionary as in [33]. This allows some blocks of the dictionary to be shared by different classes. 3. A dictionary is trained simultaneously with respect to each group of training samples X (g) using our proposed block/group regularized SC algorithm. We evaluate the proposed methods on classification (supervised) and clustering (unsupervised) problems using well-known datasets. Preliminary results are encouraging, demonstrating the viability and validity of the proposed framework Related Work In this section, I will only summarize the major differences between my work and some of the representative approaches in dictionary learning and sparse representation that appear in recent years. The group sparse coding was first introduced in [1]. Elhamifar and Vidal [11] explicitly imposed block structure on the dictionary for classification. However, none of these approaches investigated neither the combined framework that incorporated both the group and block structures nor the effect of combining these two structures on dictionarylearning. Sparse Representation based Classification (SRC) was studied recently in [10, 46]. However, the dictionary is simply columns of the training data and there is no emphasis on minimizing the intra-block coherence. Although the work in [33] shares some superficial similarities with our dictionary learning algorithm, the differences are major. First, Ramirez et. al. train each block using different collection of data and therefore, there is no notion of training blocks of D simultaneously as in our framework. And because of this, the main objective in [33] is the minimization of inter-block coherence instead of the intra-block coherence. Finally, Sprechmann et. al. [38] proposed a sparse coding scheme that is similar to ours BGSC (using the proximal optimization proposed in [47]). However, they did not propose nor investigate the integration of dictionary learning algorithms as their work is focused on signal recovery and separation. 43

44 3.2 Methods In this section, we describe the algorithms in our proposed framework. We will start with sparse coding algorithms first and work our way towards the full dictionary learning algorithm. We denote scalars with lower-case letters, matrices with upper-case letters, and the i-th block and group of a matrix (or vector) with Z [i], and Z (i), respectively Theoretical Guarantee It is important to understand the conditions on D under which our convex relaxations (Eq. (3 3) and (3 5)) are equivalent to their original combinatorial (Eq. (3 2) and (3 4)) programs. In other words, we want to examine the conditions under which our proposed programs can indeed have exact recoveries as their corresponding combinatorial programs. The conditions when X (g) is a single vector was proved in [11]. Using linear algebra, we can convert our programs, where X (g) and C (g) are matrices, into equivalent programs, where X (g) and C (g) are vectors. The conversion is straightforward and listed in the Appendix A. We then prove the equivalence conditions of our programs in a similar way as given in [11] Block/Group Sparse Coding The program P l1,p in Eq. (3 3) can be cast as an optimization problem that minimizes the objective function: Q c (C;X, D)= g Q c (C (g) ; X (g), D) = g ( 1 2 X(g) DC (g) 2 F + λ i C (g) [i] ). (3 7) p For clarity of presentation, we will present the optimization steps only for one specific group of data X and its corresponding sparse coefficients C. Eq. (3 7) can be written as: 1 2 X DC 2 F + λ i C[i] p = 1 2 X i r D [i] C [i] D [r] C [r] 2 F + λ C[r] p + c, (3 8) 44

45 where c includes the terms that do not depend on C [r]. When p = 1, this objective function is separable. Iterates of elements in C [r] can be solved using a method similar to [1]. When p = 2 2, it is only block-wise separable. Computing the gradient of Eq. (3 8) with respect to C [r], we obtain the following sub-gradient condition: D [r] X + D [r] D [i] C [i] + D [r] D [r]c [r] + λ C [r] F 0. (3 9) i r Let s assume for now the optimal solution for C [r] has a non-zero norm ( C [r] F > 0). Denoting the first two terms by N, substituting the positive semi-definite matrix D [r] D [r] with its eigen-decomposition UΣU, multiplying both sides of the equation with U and using the fact that C [r] F = C [r] C [r] F, we have C [r] UΣU C [r] + λ C [r] F = N ΣU C [r] + λ U C [r] C [r] F = U N. (3 10) Changing the variables Y = U C [r] and using the fact that the Frobenius norm is invariant under orthogonal transformations, we have ΣY + λ Y = Y ˆN, (3 11) F where ˆN = U N. Setting κ = Y F and Ŷ = Y/ Y F, we have Ŷ = (κσ + λi) 1 ˆN, s. t. Ŷ F = 1. (3 12) Since Σ is a diagonal matrix, (κσ + λi) 1 is also a diagonal matrix with diagonal entries 1/(κσ i + λ), where σ i is the i-th eigen-value in Σ. Therefore, the constraint Ŷ F = 1 2 We use element-wise l 2 norm here which is the Frobenius norm. 45

46 implies that i,j ˆN 2 i,j = 1, (3 13) (κσ i + λ) 2 where ˆN i,j is (i, j)-th element of matrix ˆN. We solve for the root of the above one-variable equation w.r.t. κ using standard numerical methods such as Newton s method. Once κ is computed, we can obtain Ŷ and Y using Eqs. (3 12) and (3 11), respectively. The iterate of C [r] can be computed by projecting Y back to the original domain i.e. C [r] = UY. Now let s revisit the positivity assumption of C [r] F (κ). When the solution of κ in Eq. (3 13) is not positive, there is no solution for Eq. (3 10) as it contradicts the assumption that κ > 0. In this case, the optimality happens at C [r] = 0 because the derivative of C [r] F does not exist when C [r] F = 0 and our objective function, Eq. (3 8), is convex and bounded from below. The proof of this claim is straight-forward. Let f(x) be a continuous convex function which is bounded from below and differentiable everywhere except at x = x o. We solve f(x) = 0 for the minimum of f(x). If the solution of f(x) = 0 does not exist, the minimum of f(x) must occur at x = x o for otherwise we would find x such that f(x ) = 0. As we can see from Eq. (3 13), the block sparsity of C depends on the value of λ. The larger λ is, the less likely there exists a feasible solution to κ in Eq. (3 13). On the other hand, when λ = 0, solution of κ will always be positive, and hence there is no nonzero C [r] s. This is analogous to the shrinkage mechanism in standard Lasso program ([7]). When X is a single vector, our BGSC is equivalent to the P lq/l 1 program in [10]. When there is no block structure on D, BGSC is equivalent to the group sparse coding (GSC) in [1] Reconstructed Block/Group Sparse Coding For clarity of presentation, we will again derive the novel R-BGSC algorithm for one group of data in this section. P l 1,p in Eq. (3 5) can be cast as an optimization problem in 46

47 terms of C [r] that minimizes 1 2 X D [i] C [i] D [r] C [r] 2 F + λ i r i D[i] C [i] p + c, (3 14) where c is a constant that includes the terms that do not depend on C [r]. The iterate of C [r] can be derived in a similar fashion as the previous algorithm. We now derive the crucial steps in optimization algorithm for p = 2 (as the p = 1 case is straightforward). Similar to the derivation in previous section, we first assume the norm D[r] C [r] F is positive for optimal solution for C [r]. Taking the gradient of the objective function with respect to C [r] and equating it to zero, we have D [r] X + D [r] D [i] C [i] + D [r] D [r]c [r] + λd [r] i r D [r] C [r] D [r] C [r] F = 0. (3 15) Now denoting the first two terms with N and computing the singular value decomposition of D [r] = USV, we have VS 2 V C [r] + λvs SV C [r] = N. (3 16) SV C [r] F Multiplying both sides of the above equation with V, and letting Y = κ = SV C [r] F, and ˆN = V N, we have SV C [r] SV C [r] F, (κs + λs)y = ˆN = Y = (κs + λs) 1 ˆN, s. t. Y F = 1. (3 17) Using the same method as in Section 3.2.2, κ can be solved first and the iterate of C [r] can be computed. Note that when X is a single vector, R-BGSC is equivalent to the P l q/l 1 program in [10] Intra-Block Coherence Suppression Dictionary Learning The intra-block coherence is defined as ( µ S (D) = max i max p,q I(i),p q d ) p d q, (3 18) d p d q 47

48 where I(i) is the index set of the atoms in block i. Inter-block coherence µ B is defined as ( µ B (D) = max ( 1 ) σ 1 (D i j [i] n D [j]), (3 19) a where σ 1 is the largest singular value and n a is the size of block. As mentioned in the Introduction, it is necessary to have a dictionary updating algorithm that minimizes the intra-block coherence. We therefore proposed the following objective function: Q d (D; X, C)= D 1 2 X(g) DC (g) 2 F + γ d k 2 + g k=1 β d pd q 2 + λ Ω(C), (3 20) b p,q I(b),p q where Ω is the regularizer term on C (See Eq.(3 14) and (3 8) ). The first two terms are the same as the objective function used in [1]. Their formulation facilitates the removal of dictionary atoms with low predictive power given the data must be mean-subtracted. We add the third term to minimize the intra-block coherence. For the sake of clarity, we derive the update formula required in optimizing the objective function above for one group of data. Again, we first assume the optimal solution for d k to have a non-zero norm. Computing the gradient with respect to d r, and equating it to zero, we have d r Xc r + d k c k c r + d r c r c r + γ + β d r 2 k r d j d j d r = 0, (3 21) j I(b),j r where c r is the r-th row of C and d r is in block b. Note that c r c r indicates the weight of how much the atom d r is being used to encode X. It is clear from the first three terms of the above equation why group-regularized SC algorithms tend to generate high intra-block coherence blocks. As we can see, the value of d r depends not only on how much it is being used to encode X (1 st and 3 rd term) but 48

49 also on how much other d k s are being used to encode X. Since, BGSC and R-BGSC minimize the number of blocks to be used for encoding X, the atoms d r are likely in the same block as d k. For example, if the coefficient C of X has only one non-zero block, then the atoms d r and d k, which correspond to the non-zero rows of coefficients c s in the above equation, are all in the same block. Therefore, updating d k using only the first three terms in the above equation will result in high intra-block coherence. This justifies putting the intra-block coherence suppressing regularizer term in Eq. (3 20). To the best of our knowledge, there is no work discussing how to group the training samples. Intuitively, one would split a class of training data into multiple similar groups using techniques such as K-means. However, this might put all the in-class outliers, e.g. badly written 9 s that look like a 7, into one group and hence allow them to act as one different class and to be used to train the dictionary blocks corresponding to the wrong classes. From our empirical observations, it is better to have a group of data that has similar variability as the whole class. This would force these in-class outliers to be regularized by inliers of the same class. Continue from Eq. (3 21), replacing first two terms with ν i, c r c r with t, and dj d j with Φ r, Eq. (3 21) becomes d r td r + γ + βφ r d r = ν i = d r 2 U d r tu d r + γ + βσ U Φ U d r = U ν i, (3 22) d r 2 where U Σ Φ U is the eigen decomposition of Φ r and Σ Φ is a diagonal matrix only contains the non-zero eigen-values of Φ r and U are the corresponding eigen-vectors. Denoting U d r by y, U d U r d r 2 by κ, and U ν i by ν i, respectively, Eq. (3 22) then becomes κty + γy + κβ Σ Φ y = ν i = y = ((κt + γ)i + κβσ Φ ) 1 ν i, s. t. y 2 = 1. (3 23) 49

50 We can use the same methods as in previous sections to solve for the iterate of d r, and if the solution κ is 0, we will set d k = 0. Note that it is not uncommon to add a post-processing step to make atoms in D to have unit norms or simply requiring d r 2 to be 1. This changes the iterate of d r to d r = (ti + βφ r ) 1 ν i as d r 2 = 1 and therefore makes the whole algorithm much more efficient as computing eigen-decomposition of a typically large matrix Φ r can now be avoided. 3.3 Experiments and Discussions Hand-Written Digit Recognition We used the USPS dataset [17], which contains a total of 9, by-16 images of hand written digits, in this experiment We vectorized the images and normalized the vectors to have unit l 2 norm. We collected 15 groups of data from each digit where each group contained 50 randomly chosen images from the same class. The experiment was conducted as follows: 1. Generate a random dictionary D with n b blocks and each block contains n a columns of atoms (a total of n b n a columns). 2. Iteratively compute coefficients using BGSC and update the dictionary using ICS- DL algorithm. 3. Use the coefficients of the training data to train 10 one-vs-all linear SVMs[3]. 4. Compute the sparse coefficients of the test samples using either BGSC or R-BGSC. Use the SVMs to classify the test samples using the coefficients. Table 3-1 demonstrates the impact of the dictionary s block structure on the error rates. The parameters are β = 200, λ train = 0.6 3, and λ test = 0.2. For the experiment in the last column of Table 3-1, we assign two blocks to each digit. The results show that the error rates are similar when the number of blocks (n b ) is greater than 10 even though 3 λ train varies slightly with respect to n a. 50

51 Figure 3-2. Visualization of the absolute values of the sparse coefficients of the training samples. Each column contains 15 groups. Gray pixels correspond to non-zero coefficients. Gray areas are the non-zero coefficients. Darker color represents larger values. the number of classes of this dataset is 10. The reason is that there exists some variability within each class and mutual similarity between images of different classes. In fact, as shown in Fig. 3-2, the sparse coefficients of most of the training data have 3 to 6 active (non-zero) blocks when n b = 20. The last column of Table 3-1 shows that the hard assignment of blocks to classes results in higher error rate even though the size of the dictionary is twice as large as those of the first three experiments in Table 3-1. As mentioned in the Introduction, we did not assign blocks to classes, and the rationale behind this is that we want to let data with larger variability to use more blocks for encoding. Moreover, we allow data from different class to share mutual blocks. Fig. 3-2 illustrates the coefficients of the training data. We can see that 7 and 9 share two blocks of dictionary due to their similarity. However, they each have an exclusive block with large coefficients (darker in color) to allow them to encode the difference. Next we demonstrate the effect of the value β in ICS-DL on classification rates. The parameters are λ train = 0.4 and (n b, n a ) = (20, 25). When β = 0, our ICS-DL algorithm does not suppress intra-block coherence and is hence equivalent to the dictionary learning algorithm in [1]. We used BGSC to compute the coefficients during training. During 51

52 Intra-Block Coherence Training Iterations Error Rate(%) Figure 3-3. Intra-block coherence (solid) and error rates (dotted) of two dictionaries (red for β = 200 and blue for β = 0). Error rates of the first 30 iterations are not shown. Table 3-1. Classification error(%) with different structure on D. n b : number of blocks in D. n a : number of atoms in each block. (n b, n a ) (20,25) (40,12) (10,50) (20,50) (20,50) Error(%) : Assign each digit to two blocks of the dictionary. testing, we used either BGSC or R-BGSC to compute the coefficients of test samples. λ test was varied between 0.15 and 0.35 and the best result was reported in Table 3-2. We stopped the training roughly after 200 iterations when the dictionary update did not change much. The results in Table 3-2 suggest that suppressing the intra-block coher ence can indeed improve the performance. However, as β increases, the error rate increases. In the extreme case when imposing a strict orthogonality on the blocks using the UOB-DL, the error rate increased to 4.27(see Table 3-3). These results provide an empirical support for our claim of not using strict orthogonality constraint. Note that when β = 0, our result is very close to that of SISF-DL[33] (See Table 3-3). However, our ICS-DL algorithm does not impose any inter-block orthogonality constraint 52

53 Error(%) :G BG BG 2:G BG R-BG 3:I BG BG 4:I BG R-BG 5:I R-BG R-BG λ test Figure 3-4. Error rates (%) of the USPS dataset under five different scenarios. The scenarios differ in terms of how the training samples are organized to compute the coefficients and which proposed SC algorithms were used. First column in the legend (separated by ) indicates how the coefficients of the training samples are computed, in groups (G) or individually (I). The second column indicates which SC algorithm is used to compute the coefficients of the training samples. The third column indicates which SC algorithm is used to compute the coefficients of the test samples individually. on the dictionary as SISF-DL does. It is probably due to that our framework does not hard assign class to block of dictionary and that we imposed group structure on data. Table 3-2. Classification error(%) with different β in ICS-DL. β Error(%) To further demonstrate the intra-block coherence suppressing property of our ICS-DL algorithm, we plot the intra-block coherence values of the dictionaries trained with β = 0 and β = 200, respectively, in Fig We also provided the error rates every 4 iterations from the 30-th iteration onward. Solid and dotted lines indicate the coherence and error, respectively. The red solid line demonstrates that our ICS-DL method can keep the intra-block coherence at a low value. On the contrary, without the intra-block coherence suppression term, the blue solid line shows that the coherence value becomes comparably 53

54 large with increasing number of iterations. The blue dotted line shows that its associated error rate even increases between iterations 40 and 60 which implies that over-fitting occurs within some blocks. Once we have a trained dictionary, we used the coefficients of training samples to train 10 linear SVMs. We can use the already available coefficients computed during the training phase. These coefficients are computed as a group. Another way to obtain coefficients of training samples is to re-compute them individually. Fig. 3-4 shows the error rates of five of the different scenarios. The dictionary was trained with β = 400, λ train = 0.4, (n b, n a ) = (20, 25) and number of iteration is 150. The results in Fig. 3-4 shows that R-BGSC generally performed slightly better than BGSC especially in scenarios 1 and 2 in Fig However, the result from scenario 3 with λ test = 0.25 achieves the best error rate at 2.26% (0.02% better than that of scenario 4 with λ test = 0.30). Finally, we compared our results with other state-of-the-art results using dictionary learning algorithms ([28, 33]) in the top row of Table 3-3. We also compare with the UOB-DL ([23]) which impose strict orthogonality constraint on blocks. The results show that our algorithms outperformed other dictionary learning methods even the one specially tailored for performing hand-written digits recognition [16]. Although Table 3-2 suggests that suppressing intra-block coherence of D improves the classification performance, imposing a strict orthogonality on the blocks, however, does not result in any improvement. Table 3-3. Comparison of classification error rate(%) of the USPS and the MNIST datasets with recently published approaches. The results of SISF-DL, SDL-DL, TDK-SVM are taken from [33], [28], and [16], respectively. BGSC R-BGSC SISF-DL SDL-DL UOB-DL TDK USPS MNIST

55 We also apply our framework on the MNIST dataset. However, due to the amount and complexity of this dataset, we were not able to fully exploit different dictionary structures and parameters to obtain a reasonable result. The parameters to obtain the results in Table 3-3 are (n b, n a ) = (40, 40) 4, β = 500, λ train = 1.20, and λ test = groups, each contains 100 data, were used Group Regularized Face Classification Image classification using the SRC framework is known to be sensitive to variations such as misalignment or lighting conditions between test and training samples, and a small amount of variation could often negatively affect the performance. One straightforward solution would be to include all possible variations in the dictionary ([22]). However, this would increase the computational cost drastically. Therefore, applying perturbations on the test image on-line offers a more practical solution. The perturbation can be some triansformations to compensate spatial mis-alignment. If the normal vectors of the test face image is provided or can be computed ([12]), the perturbations can be different lighting conditions. We propose a SRC framework here for face recognition that can alleviate the negative effect of variations between the training and test samples. In this experiment, we used the cropped Extended Yale Face Database B [21] which contains images of 38 persons under different illumination conditions. To simulate the large variations between training and test samples, we used the images with the azimuth and elevation angles of the light source 35 o as the training samples. As shown in the top row of Fig. 3-5A, these images are well-lit and contain little to no shadow. We used these samples to simulate the well-prepared laboratory grade data. We kept the rest of the dataset, which contains larger amount of shadow, as the test samples (bottom row of Fig. 3-5A). We used the test samples to simulate poorly acquired data or some poor perturbations of one single test sample. For example, we have one or 4 Our dictionary size is 5 times smaller than what was used in SISF-DL. 55

56 a few poorly acquired test images that differ a lot from the training samples in D. We estimate the normal of each pixel in the image and generate many illumination conditions of this image that are closer to those in D. However, due to the poor estimation of the normals, not all perturbations have good quality. In this experiment, we want to demonstrate that combining these perturbations as a group can improve the classification performance. The experiment was conducted as follows: 1. Project the samples down to R m using PCA. 2. Use the training samples as atoms of D. D has n b = 38 blocks where each block contains n a = 24 or 23 atoms Randomly pick n g test samples from one class to form a group X. 4. Compute the coefficients C of X using BGSC, R-BGSC, and GSC([1]). 5. Compute the class label of each column of X individually using label(x) = ( min i x D[i] c [i] 2) 2, where c is the sparse coefficient of x. Other than the three methods above, we also used (a) framework in [46](WSC), (b) two algorithms, P l2 /l 1 (BSC) and P l 2 /l 1 (B SC), in [10]. Since these methods do not impose group structure on data, we computed the coefficients of test samples individually. The rationale behind using the group structure is that as long as parts of some images (e.g. 1 st, 3 rd and 4 th images in the bottom row of Fig. 3-5A) are similar to a few atoms/blocks in D, they will generate high responses with respect to these atoms. It enables an active set that contains these atoms, and it further forces the other images in X to use these atoms for encoding. Therefore, it reduces the chance of other severely shadowed images to be encoded by using other irrelevant blocks. The classification rates of the six methods are shown in Fig. 3-5B. The results indicate that group regularized methods are generally better than the others. Our novel 5 Missing samples in some categories. 56

57 R-BGSC significantly outperformed others by roughly 20%. Note that the classification rates of BSC and WSC in Fig. 3-5B are much worse than those listed in [10] and [46], respectively. This is due to that, in their experiments, they randomly chose half of the dataset as training (used as the dictionary) and the other half as test samples. Therefore their dictionaries are almost two times larger than ours and the variability between training and test samples are minimized due to the random selection. 100 A Classification rate(%) Group Size B BGSC R-BGSC GSC BSC B'SC WSC Figure 3-5. A: Top row: Well-lit faces in the training set. Bottom: Test samples contain large amount of shadows. B: Classification rates of the 6 methods of different group size (n g ) with m = 600. The λ for BGSC, R BGSC, GSC, BSC, B SC, and WSC are 0.2, 0.2, 0.05, 0.1, 0.1, and 0.02, respectively. 57