Universal Regularizers For Robust Sparse Coding and Modeling

Transcription

1 1 Universl Regulrizers For Robust Sprse Coding nd Modeling Igncio Rmírez nd Guillermo Spiro Deprtment of Electricl nd Computer Engineering University of Minnesot Abstrct rxiv: v2 [cs.it] 3 Aug 21 Sprse dt models, where dt is ssumed to be well represented s liner combintion of few elements from dictionry, hve gined considerble ttention in recent yers, nd their use hs led to stte-of-the-rt results in mny signl nd imge processing tsks. It is now well understood tht the choice of the sprsity regulriztion term is criticl in the success of such models. Bsed on codelength minimiztion interprettion of sprse coding, nd using tools from universl coding theory, we propose frmework for designing sprsity regulriztion terms which hve theoreticl nd prcticl dvntges when compred to the more stndrd l or l 1 ones. The presenttion of the frmework nd theoreticl foundtions is complemented with exmples tht show its prcticl dvntges in imge denoising, zooming nd clssifiction. I. INTRODUCTION Sprse modeling clls for constructing succinct representtion of some dt s combintion of few typicl ptterns (toms) lerned from the dt itself. Significnt contributions to the theory nd prctice of lerning such collections of toms (usully clled dictionries or codebooks), e.g., [1], [14], [33], nd of representing the ctul dt in terms of them, e.g., [8], [11], [12], hve been developed in recent yers, leding to stte-of-the-rt results in mny signl nd imge processing tsks [24], [26], [27], [34]. We refer the reder for exmple to [4] for recent review on the subject. A criticl component of sprse modeling is the ctul sprsity of the representtion, which is controlled by regulriztion term (regulrizer for short) nd its ssocited prmeters. The choice of the functionl form of the regulrizer nd its prmeters is chllenging tsk. Severl solutions to this problem hve been proposed in the literture, rnging from the utomtic tuning of the prmeters [2] to Byesin models, where these prmeters re themselves considered s rndom vribles [17], [2], [51]. In this work we dopt the interprettion of sprse coding s codelength minimiztion problem. This is nturl nd objective method for ssessing the qulity of sttisticl model for describing given dt, nd which is bsed on the Minimum Description Length (MDL) principle [37]. In this frmework, the regulriztion term in the sprse coding formultion is interpreted s the cost in bits of describing the sprse liner coefficients used to reconstruct the dt. Severl works on imge coding using this pproch were developed in the 199 s under the nme of complexity-bsed or compression-bsed coding, following the populriztion of MDL s powerful sttisticl modeling tool [9], [31], [4]. The focus on these erly works ws in denoising using wvelet bsis, using either generic symptotic results from MDL or fixed probbility models, in order to compute the description length of the coefficients. A lter, mjor brekthrough in MDL theory ws the doption of universl coding tools to compute optiml codelengths. In this work, we improve nd extend on previous results in this line of work by designing regulriztion terms bsed on such universl codes for imge coefficients, mening tht the codelengths obtined when encoding the coefficients of ny (nturl) imge with such codes will be close to the shortest codelengths tht cn be obtined with ny model fitted specificlly for tht prticulr instnce of coefficients. The resulting frmework not only formlizes sprse coding from the MDL nd universl coding perspectives but lso leds to fmily of universl regulrizers which we show to consistently improve results in imge processing tsks such s denoising nd clssifiction. These models lso enjoy severl desirble theoreticl nd prcticl properties such s sttisticl consistency (in certin cses), improved robustness to outliers in the dt, nd improved sprse signl recovery (e.g., decoding of sprse signls from compressive sensing point of view [5]) when compred with the trditionl l nd l 1 -bsed techniques in prctice. These models lso yield to the use of simple nd efficient optimiztion technique for solving the corresponding sprse coding

2 2 problems s series of weighted l 1 subproblems, which in turn, cn be solved with off-the-shelf lgorithms such s LARS [12] or IST [11]. Detils re given in the sequel. Finlly, we pply our universl regulrizers not only for coding using fixed dictionries, but lso for lerning the dictionries themselves, leding to further improvements in ll the forementioned tsks. The reminder of this pper is orgnized s follows: in Section II we introduce the stndrd frmework of sprse modeling. Section III is dedicted to the derivtion of our proposed universl sprse modeling frmework, while Section IV dels with its implementtion. Section V presents experimentl results showing the prcticl benefits of the proposed frmework in imge denoising, zooming nd clssifiction tsks. Concluding remrks re given in Section VI. II. SPARSE MODELING AND THE NEED FOR BETTER MODELS Let X R M N be set of N column dt smples x j R M, D R M K dictionry of K column toms d k R M, nd A R K N, j R K, set of reconstruction coefficients such tht X = D A. We use T k to denote the k-th row of A, the coefficients ssocited to the k-th tom in D. For ech j = 1,..., N we define the ctive set of j s A j = {k : kj, 1 k K}, nd j = A j s its crdinlity. The gol of sprse modeling is to design dictionry D such tht for ll or most dt smples x j, there exists coefficients vector j such tht x j D j nd j is smll (usully below some threshold L K). Formlly, we would like to solve the following problem N ψ( j ) s.t. x j D j 2 2 ɛ, j = 1,..., N, (1) min D,A j=1 where ψ( ) is regulriztion term which induces sprsity in the columns of the solution A. Usully the constrint d k 2 1, k = 1,..., K, is dded, since otherwise we cn lwys decrese the cost function rbitrrily by multiplying D by lrge constnt nd dividing A by the sme constnt. When D is fixed, the problem of finding sprse j for ech smple x j is clled sprse coding, j = rg min ψ( j ) s.t. x j D j 2 2 ɛ. (2) Among possible choices of ψ( ) re the l pseudo-norm, ψ( ) =, nd the l 1 norm. The former tries to solve directly for the sprsest j, but since it is non-convex, it is commonly replced by the l 1 norm, which is its closest convex pproximtion. Furthermore, under certin conditions on (fixed) D nd the sprsity of j, the solutions to the l nd l 1 -bsed sprse coding problems coincide (see for exmple [5]). The problem (1) is lso usully formulted in Lgrngin form, min D,A long with its respective sprse coding problem when D is fixed, N x j D j λψ( j), (3) j=1 j = rg min x j D λψ(). (4) Even when the regulrizer ψ( ) is convex, the sprse modeling problem, in ny of its forms, is jointly non-convex in (D, A). Therefore, the stndrd pproch to find n pproximte solution is to use lternte minimiztion: strting with n initil dictionry D (), we minimize (3) lterntively in A vi (2) or (4) (sprse coding step), nd D (dictionry updte step). The sprse coding step cn be solved efficiently when ψ( ) = 1 using for exmple IST [11] or LARS [12], or with OMP [28] when ψ( ) =. The dictionry updte step cn be done using for exmple MOD [14] or K-SVD [1]. A. Interprettions of the sprse coding problem We now turn our ttention to the sprse coding problem: given fixed dictionry D, for ech smple vector x j, compute the sprsest vector of coefficients j tht yields good pproximtion of x j. The sprse coding problem dmits severl interprettions. Wht follows is summry of these interprettions nd the insights tht they provide into the properties of the sprse models tht re relevnt to our derivtion.

3 3 1) Model selection in sttistics: Using the l norm s ψ( ) in (4) is known in the sttistics community s the Akike s Informtion Criterion (AIC) when λ = 1, or the Byes Informtion Criterion (BIC) when λ = 1 2 log M, two populr forms of model selection (see [22, Chpter 7]). In this context, the l 1 regulrizer ws introduced in [43], gin s convex pproximtion of the bove model selection methods, nd is commonly known (either in its constrined or Lgrngin forms) s the Lsso. Note however tht, in the regression interprettion of (4), the role of D nd X is very different. 2) Mximum posteriori: Another interprettion of (4) is tht of mximum posteriori (MAP) estimtion of j in the logrithmic scle, tht is j = rg mx {log P ( x j )} = rg mx {log P (x j ) + log P ()} = rg min { log P (x j ) log P ()}, (5) where the observed smples x j re ssumed to be contminted with dditive, zero men, IID Gussin noise with vrince σ 2, P (x j ) e 1 2σ 2 x j D 2 2, nd prior probbility model on with the form P () e θψ() is considered. The energy term in Eqution (4) follows by plugging the previous two probbility models into (5) nd fctorizing 2σ 2 into λ = 2σ 2 θ. According to (5), the l 1 regulrizer corresponds to n IID Lplcin prior with men nd inverse-scle prmeter θ, P () = K k=1 θe θ k = θ K e θ 1, which hs specil mening in signl processing tsks such s imge or udio compression. This is due to the widely ccepted fct tht representtion coefficients derived from predictive coding of continuous-vlued signls, nd, more generlly, responses from zeromen filters, re well modeled using Lplcin distributions. For exmple, for the specil cse of DCT coefficients of imge ptches, n nlyticl study of this phenomenon is provided in [25], long with further references on the subject. 3) Codelength minimiztion: Sprse coding, in ll its forms, hs yet nother importnt interprettion. Suppose tht we hve fixed dictionry D nd tht we wnt to use it to compress n imge, either losslessly by encoding the reconstruction coefficients A nd the residul X DA, or in lossy mnner, by obtining good pproximtion X DA nd encoding only A. Consider for exmple the ltter cse. Most modern compression schemes consist of two prts: probbility ssignment stge, where the dt, in this cse A, is ssigned probbility P (A), nd n encoding stge, where code C(A) of length L(A) bits is ssigned to the dt given its probbility, so tht L(A) is s short s possible. The techniques known s Arithmetic nd Huffmn coding provide the best possible solution for the encoding step, which is to pproximte the Shnnon idel codelength L(A) = log P (A) [1, Chpter 5]. Therefore, modern compression theory dels with finding the coefficients A tht mximize P (A), or, equivlently, tht minimize log P (A). Now, to encode X lossily, we obtin coefficients A such tht ech dt smple x j is pproximted up to certin l 2 distortion ɛ, x j D j 2 2 ɛ. Therefore, given model P () for vector of reconstruction coefficients, nd ssuming tht we encode ech smple independently, the optimum vector of coefficients j for ech smple x j will be the solution to the optimiztion problem j = rg min log P () s.t. x j D j 2 2 ɛ, (6) which, for the choice P () e ψ() coincides with the error constrined sprse coding problem (2). Suppose now tht we wnt lossless compression. In this cse we lso need to encode the reconstruction residul x j D j. Since P (x, ) = P (x )P (), the combined codelength will be L(x j, j ) = log P (x j, j ) = log P (x j j ) log P ( j ). (7) Therefore, obtining the best coefficients j mounts to solving min L(x j, j ), which is precisely the MAP formultion of (5), which in turn, for proper choices of P (x ) nd P (), leds to the Lgrngin form of sprse coding (4). 1 1 Lplcin models, s well s Gussin models, re probbility distributions over R, chrcterized by continuous probbility density functions, f() = F (), F () = P (x ). If the reconstruction coefficients re considered rel numbers, under ny of these distributions, ny instnce of A R K N will hve mesure, tht is, P (A) =. In order to use such distributions s our models for the dt, we ssume tht the coefficients in A re quntized to precision, smll enough for the density function f() to be pproximtely constnt in ny intervl [ /2, + /2], R, so tht we cn pproximte P () f(), R. Under these ssumptions, log P () log f() log, nd the effect of on the codelength produced by ny model is the sme. Therefore, we will omit in the sequel, nd tret density functions nd probbility distributions interchngebly s P ( ). Of course, in rel compression pplictions, needs to be tuned.

4 4 Fig. 1. Stndrd 8 8 DCT dictionry (), globl empiricl distribution of the coefficients in A (b, log scle), empiricl distributions of the coefficients ssocited to ech of the K = 64 DCT toms (c, log scle). The distributions in (c) hve similr hevy tiled shpe (hevier thn Lplcin), but the vrince in ech cse cn be significntly different. (d) Histogrm of the K = 64 different ˆθ k vlues obtined by fitting Lplcin distribution to ech row T k of A. Note tht there re significnt occurrences between ˆθ = 5 to ˆθ = 25. The coefficients A used in (b-d) were obtined from encoding ptches (fter removing their DC component) rndomly smpled from the Pscl 26 dtset of nturl imges [15]. (e) Histogrms showing the sptil vribility of the best locl estimtions of ˆθ k for few rows of A cross different regions of n imge. In this cse, the coefficients A correspond to the sprse encoding of ll 8 8 ptches from single imge, in scn-line order. For ech k, ech vlue of ˆθ k ws computed from rndom contiguous block of 25 smples from T k. The procedure ws repeted 4 times to obtin n empiricl distribution. The wide supports of the empiricl distributions indicte tht the estimted ˆθ cn hve very different vlues, even for the sme tom, depending on the region of the dt from where the coefficients re tken. As one cn see, the codelength interprettion of sprse coding is ble to unify nd interpret both the constrined nd unconstrined formultions into one consistent frmework. Furthermore, this frmework offers nturl nd objective mesure for compring the qulity of different models P (x ) nd P () in terms of the codelengths obtined. 4) Remrks on relted work: As mentioned in the introduction, the codelength interprettion of signl coding ws lredy studied in the context of orthogonl wvelet-bsed denoising. An erly exmple of this line of work considers regulriztion term which uses the Shnnon Entropy function p i log p i to give mesure of the sprsity of the solution [9]. However, the Entropy function is not used s mesure of the idel codelength for describing the coefficients, but s mesure of the sprsity (ctully, group sprsity) of the solution. The MDL principle ws pplied to the signl estimtion problem in [4]. In this cse, the codelength term includes the description of both the loction nd the mgnitude of the nonzero coefficients. Although pioneering effort, the model ssumed in [4] for the coefficient mgnitude is uniform distribution on [, 1], which does not exploit priori knowledge of imge coefficient sttistics, nd the description of the support is slightly wsteful. Furthermore, the codelength expression used is n symptotic result, ctully equivlent to BIC (see Section II-A1) which cn be misleding when working with smll smple sizes, such s when encoding smll imge ptches, s in current stte of the rt imge processing pplictions. The uniform distribution ws lter replced by the universl code for integers [38] in [31]. However, s in [4], the model is so generl tht it does not perform well for the specific cse of coefficients rising from imge decompositions, leding to poor results. In contrst, our models re derived following creful nlysis of imge coefficient sttistics. Finlly, probbility models suitble to imge coefficient sttistics of the form P () e β (known s generlized Gussins) were pplied to the MDL-bsed signl coding nd estimtion frmework in [31]. The justifiction for such models is bsed on the empiricl observtion tht sprse coefficients sttistics exhibit hevy tils (see next section). However, the choice is d hoc nd no optimlity criterion is vilble to compre it with other possibilities. Moreover, there is no closed form solution for performing prmeter estimtion on such fmily of models, requiring numericl optimiztion techniques. In Section III, we derive number of probbility models for which prmeter estimtion cn be computed efficiently in closed form, nd which re gurnteed to optimlly describe imge coefficients. B. The need for better model As explined in the previous subsection, the use of the l 1 regulrizer implies tht ll the coefficients in A shre the sme Lplcin prmeter θ. However, s noted in [25] nd references therein, the empiricl vrince of coefficients ssocited to different toms, tht is, of the different rows T k of A, vries gretly with k = 1..., K. This is clerly seen in Figures 1(-c), which show the empiricl distribution of DCT coefficients of 8 8 ptches.

5 5 As the vrince of Lplcin is 2/θ 2, different vrinces indicte different underlying θ. The histogrm of the set {ˆθk, k = 1,..., K} of estimted Lplcin prmeters for ech row k, Figure 1(d), shows tht this is indeed the cse, with significnt occurrences of vlues of ˆθ in rnge of 5 to 25. The strightforwrd modifiction suggested by this phenomenon is to use model where ech row of A hs its own weight ssocited to it, leding to weighted l 1 regulrizer. However, from modeling perspective, this results in K prmeters to be djusted insted of just one, which often results in poor generliztion properties. For exmple, in the cses studied in Section V, even with thousnds of imges for lerning these prmeters, the results of pplying the lerned model to new imges were lwys significntly worse (over 1dB in estimtion problems) when compred to those obtined using simpler models such s n unweighted l 1. 2 One reson for this filure my be tht rel imges, s well s other types of signls such s udio smples, re fr from sttionry. In this cse, even if ech tom k is ssocited to its own θ k (λ k ), the optiml vlue of θ k cn hve significnt locl vritions t different positions or times. This effect is shown in Figure 1(e), where, for ech k, ech θ k ws re-estimted severl times using smples from different regions of n imge, nd the histogrm of the different estimted vlues of ˆθ k ws computed. Here gin we used the DCT bsis s the dictionry D. The need for flexible model which t the sme time hs smll number of prmeters leds nturlly to Byesin formultions where the different possible λ k re mrginlized out by imposing n hyper-prior distribution on λ, smpling λ using its posterior distribution, nd then verging the estimtes obtined with the smpled sprsecoding problems. Exmples of this recent line of work, nd the closely relted Byesin Compressive Sensing, re developed for exmple in [23], [44], [49], [48]. Despite of its promising results, the Byesin pproch is often criticized due to the potentilly expensive smpling process (something which cn be reduced for certin choices of the priors involved [23]), rbitrriness in the choice of the priors, nd lck of proper theoreticl justifiction for the proposed models [48]. In this work we pursue the sme gol of deriving more flexible nd ccurte sprse model thn the trditionl ones, while voiding n increse in the number of prmeters nd the burden of possibly solving severl smpled instnces of the sprse coding problem. For this, we deploy tools from the very successful informtion-theoretic field of universl coding, which is n extension of the compression scenrio summrized bove in Section II-A, when the probbility model for the dt to be described is itself unknown nd hs to be described s well. III. UNIVERSAL MODELS FOR SPARSE CODING Following the discussion in the preceding section, we now hve severl possible scenrios to del with. First, we my still wnt to consider single vlue of θ to work well for ll the coefficients in A, nd try to design sprse coding scheme tht does not depend on prior knowledge on the vlue of θ. Secondly, we cn consider n independent (but not identiclly distributed) Lplcin model where the underlying prmeter θ cn be different for ech tom d k, k = 1,..., K. In the most extreme scenrio, we cn consider ech single coefficient kj in A to hve its own unknown underlying θ kj nd yet, we would like to encode ech of these coefficients (lmost) s if we knew its hidden prmeter. The first two scenrios re the ones which fit the originl purpose of universl coding theory [29], which is the design of optiml codes for dt whose probbility models re unknown, nd the models themselves re to be encoded s well in the compressed representtion. Now we develop the bsic ides nd techniques of universl coding pplied to the first scenrio, where the problem is to describe A s n IID Lplcin with unknown prmeter θ. Assuming known prmetric form for the prior, with unknown prmeter θ, leds to the concept of model clss. In our cse, we consider the clss M = {P (A θ) : θ Θ} of ll IID Lplcin models over A R K N, where N K P (A θ) = P ( kj θ), P ( kj θ) = θe θ kj j=1 k=1 nd Θ R +. The gol of universl coding is to find probbility model Q(A) which cn fit A s well s the model in M tht best fits A fter hving observed it. A model Q(A) with this property is clled universl (with respect to the model M). 2 Note tht this is the cse when the weights re found by mximum likelihood. Other pplictions of weighted l 1 regulrizers, using other types of weighting strtegies, re known to improve over l 1-bsed ones for certin pplictions (see e.g. [51]).

6 6 For simplicity, in the following discussion we consider the coefficient mtrix A to be rrnged s single long column vector of length n = K N, = ( 1,..., n ). We lso use the letter without sub-index to denote the vlue of rndom vrible representing coefficient vlues. First we need to define criterion for compring the fitting qulity of different models. In universl coding theory this is done in terms of the codelengths L() required by ech model to describe. If the model consists of single probbility distribution P ( ), we know from Section II-A3 tht the optimum codelength corresponds to L P () = log P (). Moreover, this reltionship defines one-to-one correspondence between distributions nd codelengths, so tht for ny coding scheme L Q (), Q() = 2 LQ(). Now suppose tht we re restricted to clss of models M, nd tht we need choose the model ˆP M tht ssigns the shortest codelength to prticulr instnce of. We then hve tht ˆP is the model in M tht ssigns the mximum probbility to. For clss M prmetrized by θ, this corresponds to ˆP = P ( ˆθ()), where ˆθ() is the mximum likelihood estimtor (MLE) of the model clss prmeter θ given (we will usully omit the rgument nd just write ˆθ). Unfortuntely, we lso need to include the vlue of ˆθ in the description of for the decoder to be ble to reconstruct it from the code C(). Thus, we hve tht ny model Q() inducing vlid codelengths L Q () will hve L Q () > log P ( ˆθ). The overhed of L Q () with respect to log P ( ˆθ) is known s the codelength regret, R(, Q) := L Q () ( log P ( ˆθ())) = log Q() + log P ( ˆθ())). A model Q() (or, more precisely, sequence of models, one for ech dt length n) is clled universl if R(, Q) grows sublinerly in n for ll possible reliztions of, tht is 1 n R(, Q), Rn, so tht the codelength regret with respect to the MLE becomes symptoticlly negligible. There re number of wys to construct universl probbility models. The simplest one is the so clled twoprt code, where the dt is described in two prts. The first prt describes the optiml prmeter ˆθ() nd the second prt describes the dt ccording to the model with the vlue of the estimted prmeter ˆθ, P ( ˆθ()). For uncountble prmeter spces Θ, such s compct subset of R, the vlue of ˆθ hs to be quntized in order to be described with finite number of bits d. We cll the quntized prmeter ˆθ d. The regret for this model is thus R(, Q) = L(ˆθ d ) + L( ˆθ d ) L( ˆθ) = L(ˆθ d ) log P ( ˆθ d ) ( log P ( ˆθ)). The key for this model to be universl is in the choice of the quntiztion step for the prmeter ˆθ, so tht both its description L(ˆθ d ), nd the difference log P ( ˆθ d ) ( log P ( ˆθ)), grow sublinerly. This cn be chieved by letting the quntiztion step shrink s O(1/ n) [37], thus requiring d = O(.5 log n) bits to describe ech dimension of ˆθ d. This gives us totl regret for two-prt codes which grows s dim(θ) 2 log n, where dim(θ) is the dimension of the prmeter spce Θ. Another importnt universl code is the so clled Normlized Mximum Likelihood (NML) [42]. In this cse the universl model Q () corresponds to the model tht minimizes the worst cse regret, Q () = min Q mx { log Q() + log P ( ˆθ())}, which cn be written in closed form s Q () = P ( ˆθ()) C(M,n), where the normliztion constnt C(M, n) := P ( ˆθ())d R n is the vlue of the minimx regret nd depends only on M nd the length of the dt n. 3 Note tht the NML model requires C(M, n) to be finite, something which is often not the cse. The two previous exmples re good for ssigning probbility to coefficients tht hve lredy been computed, but they cnnot be used s model for computing the coefficients themselves since they depend on hving observed them in the first plce. For this nd other resons tht will become clerer lter, we concentrte our work on third importnt fmily of universl codes derived from the so clled mixture models (lso clled Byesin mixtures). In 3 The minimx optimlity of Q () derives from the fct tht it defines complete uniquely decodble code for ll dt of length n, tht is, it stisfies the Krft inequlity with equlity. R n 2 L Q () = 1. Since every uniquely decodble code with lengths {L Q() : R n } must stisfy the Krft inequlity (see [1, Chpter 5]), if there exists vlue of such tht L Q() < L Q () (tht is 2 L Q() > 2 L Q () ), then there exists vector for which L Q( ) > L Q ( ) for the Krft inequlity to hold. Therefore the regret of Q for is necessrily greter thn C(M, n), which shows tht Q is minimx optiml.

7 7 mixture model, Q() is convex mixture of ll the models P ( θ) in M, indexed by the model prmeter θ, Q() = Θ P ( θ)w(θ)dθ, where w(θ) specifies the weight of ech model. Being convex mixture implies tht w(θ) nd Θ w(θ)dθ = 1, thus w(θ) is itself probbility mesure over Θ. We will restrict ourselves to the prticulr cse when is considered sequence of independent rndom vribles, 4 n Q() = Q j ( j ), Q j ( j ) = P ( j θ)w j (θ)dθ, (8) j=1 where the mixing function w j (θ) cn be different for ech smple j. An importnt prticulr cse of this scheme is the so clled Sequentil Byes code, in which w j (θ) is computed sequentilly s posterior distribution bsed on previously observed smples, tht is w j (θ) = P (θ 1, 2,..., n 1 ) [21, Chpter 6]. In this work, for simplicity, we restrict ourselves to the cse where w j (θ) = w(θ) is the sme for ll j. The result is n IID model where the probbility of ech smple j is mixture of some probbility mesure over R, Q j ( j ) = Q( j ) = P ( j θ)w(θ)dθ, j = 1,..., N. (9) Θ A well known result for IID mixture (Byesin) codes sttes tht their symptotic regret is O( dim(θ) 2 log n), thus stting their universlity, s long s the weighting function w(θ) is positive, continuous nd unimodl over Θ (see for exmple [21, Theorem 8.1],[41]). This gives us gret flexibility on the choice of weighting function w(θ) tht gurntees universlity. Of course, the results re symptotic nd the o(log n) terms cn be lrge, so tht the choice of w(θ) cn hve prcticl impct for smll smple sizes. In the following discussion we derive severl IID mixture models for the Lplcin model clss M. For this purpose, it will be convenient to consider the corresponding one-sided counterprt of the Lplcin, which is the exponentil distribution over the bsolute vlue of the coefficients,, nd then symmetrize bck to obtin the finl distribution over the signed coefficients. A. The conjugte prior In generl, (9) cn be computed in closed form if w(θ) is the conjugte prior of P ( θ). When P ( θ) is n exponentil (one-sided Lplcin), the conjugte prior is the Gmm distribution, w(θ κ, β) = Γ(κ) 1 θ κ 1 β κ e βθ, θ R +, where κ nd β re its shpe nd scle prmeters respectively. Plugging this in (9) we obtin the Mixture of exponentils model (MOE), which hs the following form (see Appendix A for the full derivtion), Θ Q MOE ( β, κ) = κβ κ ( + β) (κ+1), R +. (1) With some buse of nottion, we will lso denote the symmetric distribution on s MOE, Q MOE ( β, κ) = 1 2 κβκ ( + β) (κ+1), R. (11) Although the resulting prior hs two prmeters to del with insted of one, we know from universl coding theory tht, in principle, ny choice of κ nd β will give us model whose codelength regret is symptoticlly smll. Furthermore, being IID models, ech coefficient of itself is modeled s mixture of exponentils, which mkes the resulting model over very well suited to the most flexible scenrio where the underlying θ cn be different for ech j. In Section V-B we will show tht single MOE distribution cn fit ech of the K rows of A better thn K seprte Lplcin distributions fine-tuned to these rows, with totl of K prmeters to be estimted. Thus, not only we cn del with one single unknown θ, but we cn ctully chieve mximum flexibility with only two prmeters (κ nd β). This property is prticulr of the mixture models, nd does not pply to the other universl models presented. 4 More sophisticted models which include dependencies between the elements of re out of the scope of this work.

8 8 Finlly, if desired, both κ nd β cn be esily estimted using the method of moments (see Appendix A). Given smple estimtes of the first nd second non-centrl moments, ˆµ 1 = 1 n n j=1 j nd ˆµ 2 = 1 n n j=1 j 2, we hve tht ˆκ = 2(ˆµ 2 ˆµ 2 1)/(ˆµ 2 2ˆµ 2 1) nd ˆβ = (ˆκ 1)ˆµ1. (12) When the MOE prior is plugged into (5) insted of the stndrd Lplcin, the following new sprse coding formultion is obtined, K j = rg min x j D λ MOE log ( k + β), (13) where λ MOE = 2σ 2 (κ + 1). An exmple of the MOE regulrizer, nd the thresholding function it induces, is shown in Figure 2 (center column) for κ = 2.5, β =.5. Smooth, differentible non-convex regulrizers such s the one in in (13) hve become minstrem robust lterntive to the l 1 norm in sttistics [16], [51]. Furthermore, it hs been shown tht the use of such regulrizers in regression leds to consistent estimtors which re ble to identify the relevnt vribles in regression model (orcle property) [16]. This is not lwys the cse for the l 1 regulrizer, s ws proved in [51]. The MOE regulrizer hs lso been recently proposed in the context of compressive sensing [6], where it is conjectured to be better thn the l 1 -term t recovering sprse signls in compressive sensing pplictions. 5 This conjecture ws prtilly confirmed recently for non-convex regulrizers of the form ψ() = r with < r < 1 in [39], [18], nd for more generl fmily of non-convex regulrizers including the one in (13) in [47]. In ll cses, it ws shown tht the conditions on the sensing mtrix (here D) cn be significntly relxed to gurntee exct recovery if non-convex regulrizers re used insted of the l 1 norm, provided tht the exct solution to the non-convex optimiztion problem cn be computed. In prctice, this regulrizer is being used with success in number of pplictions here nd in [7], [46]. 6 Our experimentl results in Section V provide further evidence on the benefits of the use of non-convex regulrizers, leding to much improved recovery ccurcy of sprse coefficients compred to l 1 nd l. We lso show in Section V tht the MOE prior is much more ccurte thn the stndrd Lplcin to model the distribution of reconstruction coefficients drwn from lrge dtbse of imge ptches. We lso show in Section V how these improvements led to better results in pplictions such s imge estimtion nd clssifiction. B. The Jeffreys prior The Jeffreys prior for prmetric model clss M = {P ( θ), θ Θ}, is defined s I(θ) w(θ) =, θ Θ, (14) I(ξ) dξ where I(θ) is the determinnt of the Fisher informtion mtrix ]} I(θ) = {E P ( θ) [ 2 log P ( θ) θ 2 Θ k=1 θ=θ. (15) The Jeffreys prior is well known in Byesin theory due to three importnt properties: it virtully elimintes the hyper-prmeters of the model, it is invrint to the originl prmetriztion of the distribution, nd it is non-informtive prior, mening tht it represents well the lck of prior informtion on the unknown prmeter θ [3]. It turns out tht, for quite different resons, the Jeffreys prior is lso of prmount importnce in the theory of universl coding. For instnce, it hs been shown in [2] tht the worst cse regret of the mixture code obtined using the Jeffreys prior pproches tht of the NML s the number of smples n grows. Thus, by using Jeffreys, one cn ttin the minimum worst cse regret symptoticlly, while retining the dvntges of mixture (not needing hindsight of ), which in our cse mens to be ble to use it s model for computing vi sprse coding. For the exponentil distribution we hve tht I(θ) = 1 θ 2. Clerly, if we let Θ = (, ), the integrl in (14) evlutes to. Therefore, in order to obtin proper integrl, we need to exclude nd from Θ (note tht 5 In [6], the logrithmic regulrizer rises from pproximting the l pseudo-norm s n l 1-normlized element-wise sum, without the insight nd theoreticl foundtion here reported. 6 While these works support the use of such non-convex regulrizers, none of them formlly derives them using the universl coding frmework s in this pper.

9 9 Fig. 2. Left to right: l 1 (green), MOE (red) nd JOE (blue) regulrizers nd their corresponding thresholding functions thres(x) := rg min {(x ) 2 + λψ( )}. The unbisedness of MOE is due to the fct tht lrge coefficients re not shrnk by the thresholding function. Also, lthough the JOE regulrizer is bised, the shrinkge of lrge coefficients cn be much smller thn the one pplied to smll coefficients. this ws not needed for the conjugte prior). We choose to define Θ = [θ 1, θ 2 ], < θ 1 < θ 2 <, leding to 1 1 w(θ) = ln(θ 2/θ 1) θ, θ [θ 1, θ 2 ]. The resulting mixture, fter being symmetrized round, hs the following form (see Appendix A): 1 1 ( Q JOE ( θ 1, θ 2 ) = e θ1 e θ2 ), R +. (16) 2 ln(θ 2 /θ 1 ) We refer to this prior s Jeffreys mixture of exponentils (JOE), nd gin overlod this cronym to refer to the symmetric cse s well. Note tht lthough Q JOE is not defined for =, its limit when is finite nd θ evlutes to 2 θ 1 2 ln(θ. Thus, by defining Q 2/θ 1) JOE() = θ2 θ1 2 ln(θ 2/θ 1), we obtin prior tht is well defined nd continuous for ll R. When plugged into (5), we get the JOE-bsed sprse coding formultion, min x j D λ JOE K {log k log(e θ1 k e θ2 k )}, (17) k=1 where, ccording to the convention just defined for Q JOE (), we define ψ JOE () := log(θ 2 θ 1 ). According to the MAP interprettion we hve tht λ JOE = 2σ 2, coming from the Gussin ssumption on the pproximtion error s explined in Section II-A. As with MOE, the JOE-bsed regulrizer, ψ JOE ( ) = log Q JOE ( ), is continuous nd differentible in R +, nd its derivtive converges to finite vlue t zero, lim ψ JOE() = θ2 2 θ2 1 θ 2 θ 1. As we will see lter in Section IV, these properties re importnt to gurntee the convergence of sprse coding lgorithms using non-convex priors. Note from (17) tht we cn rewrite the JOE regulrizer s ψ JOE ( k ) = log k log e θ1 (1 e (θ2 θ1) ) = θ 1 k + log k log(1 e (θ2 θ1) k ), so tht for sufficiently lrge k, log(1 e (θ2 θ1) k ), θ 1 k log k, nd we hve tht ψ JOE ( k ) θ 1 k. Thus, for lrge k, the JOE regulrizer behves like l 1 with λ = 2σ 2 θ 1. In terms of the probbility model, this mens tht the tils of the JOE mixture behve like Lplcin with θ = θ 1, with the region where this hppens determined by the vlue of θ 2 θ 1. The fct tht the non-convex region of ψ JOE ( ) is confined to neighborhood round could help to void flling in bd locl minim during the optimiztion (see Section IV for more detils on the optimiztion spects). Finlly, lthough hving Lplcin tils mens tht the estimted will be bised [16], the shrper pek t llows us to perform more ggressive thresholding of smll vlues, without excessively clipping lrge coefficients, which leds to the typicl over-smoothing of signls recovered using n l 1 regulrizer. See Figure 2 (rightmost column) for n exmple regulrizer bsed on JOE with prmeters θ 1 = 2, θ 2 = 1, nd the thresholding function it induces. The JOE regulrizer hs two hyper-prmeters (θ 1, θ 2 ) which define Θ nd tht, in principle, need to be tuned. One possibility is to choose θ 1 nd θ 2 bsed on the physicl properties of the dt to be modeled, so tht the possible vlues of θ never fll outside of the rnge [θ 1, θ 2 ]. For exmple, in modeling ptches from gryscle imges with limited dynmic rnge of [, 255] in DCT bsis, the mximum vrince of the coefficients cn never exceed The sme is true for the minimum vrince, which is defined by the quntiztion noise. Hving sid this, in prctice it is dvntgeous to djust [θ 1, θ 2 ] to the dt t hnd. In this cse, lthough no closed form solutions exist for estimting [θ 1, θ 2 ] using MLE or the method of moments, stndrd optimiztion techniques cn be esily pplied to obtin them. See Appendix A for detils.

10 1 C. The conditionl Jeffreys A recent pproch to del with the cse when the integrl over Θ in the Jeffreys prior is improper, is the conditionl Jeffreys [21, Chpter 11]. The ide is to construct proper prior, bsed on the improper Jeffreys prior nd the first few n smples of, ( 1, 2,..., n ), nd then use it for the remining dt. The key observtion is tht lthough the normlizing integrl I(θ)dθ in the Jeffreys prior is improper, the unnormlized prior w(θ) = I(θ) cn be used s mesure to weight P ( 1, 2,..., n θ), w(θ) = P ( 1, 2,..., n θ) I(θ) Θ P ( 1, 2,..., n ξ) I(ξ)dξ. (18) It turns out tht the integrl in (18) usully becomes proper for smll n in the order of dim(θ). In our cse we hve tht for ny n 1, the resulting prior is Gmm(κ, β ) distribution with κ := n nd β := n j=1 j (see Appendix A for detils). Therefore, using the conditionl Jeffreys prior in the mixture leds to prticulr instnce of MOE, which we denote by CMOE (lthough the functionl form is identicl to MOE), where the Gmm prmeters κ nd β re utomticlly selected from the dt. This my explin in prt why the Gmm prior performs so well in prctice, s we will see in Section V. Furthermore, we observe tht the vlue of β obtined with this pproch (β ) coincides with the one estimted using the method of moments for MOE if the κ in MOE is fixed to κ = κ + 1 = n + 1. Indeed, if computed from n smples, the method of moments for MOE gives β = (κ 1)µ 1, with µ 1 = 1 n j, which gives us β = n+1 1 n j = β. It turns out in prctice tht the vlue of κ estimted using the method of moments gives vlue between 2 nd 3 for the type of dt tht we del with (see Section V), which is just bove the minimum cceptble vlue for the CMOE prior to be defined, which is n = 1. This justifies our choice of n = 2 when pplying CMOE in prctice. As n becomes lrge, so does κ = n, nd the Gmm prior w(θ) obtined with this method converges to Kronecker delt t the men vlue of the Gmm distribution, δ κ/β ( ). Consequently, when w(θ) δ κ/β (θ), the mixture Θ P ( θ)w(θ)dθ will be close to P ( κ /β ). Moreover, from the definition of κ nd β we hve tht κ /β is exctly the MLE of θ for the Lplcin distribution. Thus, for lrge n, the conditionl Jeffreys method pproches the MLE Lplcin model. Although from universl coding point of view this is not problem, for lrge n the conditionl Jeffreys model will loose its flexibility to del with the cse when different coefficients in A hve different underlying θ. On the other hnd, smll n cn led to prior w(θ) tht is overfitted to the locl properties of the first smples, which for non-sttionry dt such s imge ptches, cn be problemtic. Ultimtely, n defines trde-off between the degree of flexibility nd the ccurcy of the resulting model. IV. OPTIMIZATION AND IMPLEMENTATION DETAILS All of the mixture models discussed so fr yield non-convex regulrizers, rendering the sprse coding problem nonconvex in. It turns out however tht these regulrizers stisfy certin conditions which mke the resulting sprse coding optimiztion well suited to be pproximted using sequence of successive convex sprse coding problems, technique known s Locl Liner Approximtion (LLA) [52] (see lso [46], [19] for lterntive optimiztion techniques for such non-convex sprse coding problems). In nutshell, suppose we need to obtin n pproximte solution to K j = rg min x j D λ ψ( k ), (19) where ψ( ) is non-convex function over R +. At ech LLA itertion, we compute (t+1) j expnsion of ψ( ) round the K elements of the current estimte (t) kj, ψ (t) k k=1 ( ) ( ) = ψ( (t) kj ) + ψ ( (t) kj ) (t) kj = ψ ( (t) kj ) + c k, by doing first order

11 11 nd solving the convex weighted l 1 problem tht results fter discrding the constnt terms c k, (t+1) K j = rg min x j D λ = rg min k=1 ψ (t) k ( k ) K x j D λ ψ ( (t) K kj ) k = rg min x j D k=1 k=1 λ (t) k k. (2) where we hve defined λ (t) k := λψ ( (t) kj ). If ψ ( ) is continuous in (, + ), nd right-continuous nd finite t, then the LLA lgorithm converges to sttionry point of (19) [51]. These conditions re met for both the MOE nd JOE regulrizers. Although, for the JOE prior, the derivtive ψ ( ) is not defined t, it converges to the limit θ2 2 θ2 1 2(θ when, which is well defined for θ 2 θ 1) 2 θ 1. If θ 2 = θ 1, the JOE mixing function is Kronecker delt nd the prior becomes Lplcin with prmeter θ = θ 1 = θ 2. Therefore we hve tht for ll of the mixture models studied, the LLA method converges to sttionry point. In prctice, we hve observed tht 5 itertions re enough to converge. Thus, the cost of sprse coding, with the proposed non-convex regulrizers, is t most 5 times tht of single l 1 sprse coding, nd could be less in prctice if wrm restrts re used to begin ech itertion. Of course we need strting point () j, nd, being non-convex problem, this choice will influence the pproximtion tht we obtin. One resonble choice, used in this work, is to define () kj =, k = 1,..., K, j = 1,..., N, where is sclr so tht ψ ( ) = E w [θ], tht is, so tht the first sprse coding corresponds to Lplcin regulrizer whose prmeter is the verge vlue of θ s given by the mixing prior w(θ). Finlly, note tht lthough the discussion here hs revolved round the Lgrngin formultion to sprse coding of (4), this technique is lso pplicble to the constrined formultion of sprse-coding given by Eqution (1) for fixed dictionry D. Expected pproximtion error: Since we re solving convex pproximtion to the ctul trget optimiztion problem, it is of interest to know how good this pproximtion is in terms of the originl cost function. To give n ide of this, fter n pproximte solution is obtined, we compute the expected vlue of the difference between the true nd pproximte regulriztion term vlues. The expecttion is tken, nturlly, in terms of the ssumed distribution of the coefficients in. Since the regulrizers re seprble, [ we cn ] compute the error in seprble wy s n expecttion over ech k-th coefficient, ζ q ( k ) = E ν q ψk (ν) ψ(ν), where ψ k ( ) is the pproximtion of ψ k ( ) round the finl estimte of k. For the cse of q = MOE, the expression obtined is (see Appendix) [ ] ζ MOE ( k, κ, β) = E ν MOE(κ,β) ψk (ν) ψ(ν) = log( k + β) + 1 [ k + β ] log β 1 k + β κ 1 κ. In the MOE cse, for κ nd β fixed, the minimum of ζ MOE occurs when k = β κ 1 = µ(β, κ). We lso hve ζ MOE () = (κ 1) 1 κ 1. The function ζ q ( ) cn be evluted on ech coefficient of A to give n ide of its qulity. For exmple, in the experiments from Section V, we obtined n verge vlue of.16, which lies between ζ MOE () =.19 nd min ζ MOE () =.9. Depending on the experiment, this represents 6% to 7% of the totl sprse coding cost function vlue, showing the efficiency of the proposed optimiztion. Comments on prmeter estimtion: All the universl models presented so fr, with the exception of the conditionl Jeffreys, depend on hyper-prmeters which in principle should be tuned for optiml performnce (remember tht they do not influence the universlity of the model). If tuning is needed, it is importnt to remember tht the proposed universl models re intended for reconstruction coefficients of clen dt, nd thus their hyperprmeters should be computed from sttistics of clen dt, or either by compensting the distortion in the sttistics cused by noise (see for exmple [3]). Finlly, note tht when D is linerly dependent nd rnk(d) = R M, the coefficients mtrix A resulting from n exct reconstruction of X will hve mny zeroes which re not properly explined by ny continuous distribution such s Lplcin. We sidestep this issue by computing the sttistics only from the non-zero coefficients in A. Deling properly with the cse P ( = ) > is beyond the scope of this work.

12 12 V. EXPERIMENTAL RESULTS In the following experiments, the testing dt X re 8 8 ptches drwn from the Pscl VOC26 testing subset, 7 which re high qulity RGB imges with 8 bits per chnnel. For the experiments, we converted the 26 imges to gryscle by verging the chnnels, nd scled the dynmic rnge to lie in the [, 1] intervl. Similr results to those shown here re lso obtined for other ptch sizes. A. Dictionry lerning For the experiments tht follow, unless otherwise stted, we use globl overcomplete dictionry D with K = 4M = 256 toms trined on the full VOC26 trining subset using the method described in [35], [36], which seeks to minimize the following cost during trining, 8 min D,A 1 N N j=1 { } x j D j λψ( j) + µ D T D 2 F, (21) where F denotes Frobenius norm. The dditionl term, µ D T D 2, encourges incoherence in the lerned F dictionry, tht is, it forces the toms to be s orthogonl s possible. Dictionries with lower coherence re well known to hve severl theoreticl dvntges such s improved bility to recover sprse signls [11], [45], nd fster nd better convergence to the solution of the sprse coding problems (1) nd (3) [13]. Furthermore, in [35] it ws shown tht dding incoherence leds to improvements in vriety of sprse modeling pplictions, including the ones discussed below. We used MOE s the regulrizer in (21), with λ =.1 nd µ = 1, both chosen empiriclly. See [1], [26], [35] for detils on the optimiztion of (3) nd (21). B. MOE s prior for sprse coding coefficients We begin by compring the performnce of the Lplcin nd MOE priors for fitting single globl distribution to the whole mtrix A. We compute A using (1) with ɛ nd then, following the discussion in Section IV, restrict our study to the nonzero elements of A. The empiricl distribution of A is plotted in Figure 3(), long with the best fitting Lplcin, MOE, JOE, nd prticulrly good exmple of the conditionl Jeffreys (CMOE) distributions. 9 The MLE for the Lplcin fit is ˆθ = N 1 / A 1 = 27.2 (here N 1 is the number of nonzero elements in A). For MOE, using (12), we obtined κ = 2.8 nd β =.7. For JOE, θ 1 = 2.4 nd θ 2 = According to the discussion in Section III-C, we used the vlue κ = 2.8 obtined using the method of moments for MOE s hint for choosing n = 2 (κ = n + 1 = 3 2.8), yielding β =.7, which coincides with the β obtined using the method of moments. As observed in Figure 3(), in ll cses the proposed mixture models fit the dt better, significntly better for both Gmm-bsed mixtures, MOE nd CMOE, nd slightly better for JOE. This is further confirmed by the Kullbck-Leibler divergence (KLD) obtined in ech cse. Note tht JOE fils to significntly improve on the Lplcin mode due to the excessively lrge estimted rnge [θ 1, θ 2 ]. In this sense, it is cler tht the JOE model is very sensitive to its hyper-prmeters, nd better nd more robust estimtion would be needed for it to be useful in prctice. Given these results, herefter we concentrte on the best cse which is the MOE prior (which, s detiled bove, cn be derived from the conditionl Jeffreys s well, thus representing both pproches). From Figure 1(e) we know tht the optiml ˆθ vries loclly cross different regions, thus, we expect the mixture models to perform well lso on per-tom bsis. This is confirmed in Figure 3(b), where we show, for ech row k, k = 1,..., K, the difference in KLD between the globlly fitted MOE distribution nd the best per-tom fitted MOE, the globlly fitted Lplcin, nd the per-tom fitted Lplcins respectively. As cn be observed, the KLD obtined with the globl MOE is significntly smller thn the globl Lplcin in ll cses, nd even the per-tom Lplcins in most of the cses. This shows tht MOE, with only two prmeters (which cn be esily estimted, s While we could hve used off-the-shelf dictionries such s DCT in order to test our universl sprse coding frmework, it is importnt to use dictionries tht led to the stte-of-the-rt results in order to show the dditionl potentil improvement of our proposed regulrizers. 9 To compute the empiricl distribution, we quntized the elements of A uniformly in steps of 2 8, which for the mount of dt vilble, gives us enough detil nd t the sme time relible sttistics for ll the quntized vlues.

13 13 Fig. 3. () Empiricl distribution of the coefficients in A for imge ptches (blue dots), best fitting Lplcin (green), MOE (red), CMOE (ornge) nd JOE (yellow) distributions. The Lplcin (KLD=.17 bits) is clerly not fitting the tils properly, nd is not sufficiently peked t zero either. The two models bsed on Gmm prior, MOE (KLD=.1 bits) nd CMOE (KLD=.1 bits), provide n lmost perfect fit. The fitted JOE (KLD=.14) is the most shrply peked t, but doest not fit the tils s tight s desired. As reference, the entropy of the empiricl distribution is H = 3. bits. (b) KLD for the best fitting globl Lplcin (drk green), per-tom Lplcin (light green), globl MOE (drk red) nd per-tom MOE (light red), reltive to the KLD between the globlly fitted MOE distribution nd the empiricl distribution. The horizontl xis represents the indexes of ech tom, k = 1,..., K, ordered ccording to the difference in KLD between the globl MOE nd the per-tom Lplcin model. Note how the globl MOE outperforms both the globl nd per-tom Lplcin models in ll but the first 4 cses. (c) ctive set recovery ccurcy of l 1 nd MOE, s defined in Section V-C, for L = 5 nd L = 1, s function of σ. The improvement of MOE over l 1 is fctor of 5 to 9. (d) PSNR of the recovered sprse signls with respect to the true signls. In this cse significnt improvements cn be observed t the high SNR rnge, specilly for highly sprse (L = 5) signls. The performnce of both methods is prcticlly the sme for σ 1. detiled in the text), is much better model thn K Lplcins (requiring K criticl prmeters) fitted specificlly to the coefficients ssocited to ech tom. Whether these modeling improvements hve prcticl impct is explored in the next experiments. C. Recovery of noisy sprse signls Here we compre the ctive set recovery properties of the MOE prior, with those of the l 1 -bsed one, on dt for which the sprsity ssumption A j L holds exctly for ll j. To this end, we obtin sprse pproximtions to ech smple x j using the l -bsed Orthogonl Mtching Pursuit lgorithm (OMP) on D [28], nd record the resulting ctive sets A j s ground truth. The dt is then contminted with dditive Gussin noise of vrince σ nd the recovery is performed by solving (1) for A with ɛ = CMσ 2 nd either the l 1 or the MOE-bsed regulrizer for ψ( ). We use C = 1.32, which is stndrd vlue in denoising pplictions (see for exmple [27]). For ech smple j, we mesure the error of ech method in recovering the ctive set s the Hmming distnce between the true nd estimted support of the corresponding reconstruction coefficients. The ccurcy of the method is then given s the percentge of the smples for which this error flls below certin threshold T. Results re shown in Figure 3(c) for L = (5, 1) nd T = (2, 4) respectively, for vrious vlues of σ. Note the very significnt improvement obtined with the proposed model. Given the estimted ctive set A j, the estimted clen ptch is obtined by projecting x j onto the subspce defined by the toms tht re ctive ccording to A j, using lest squres (which is the stndrd procedure for denoising once the ctive set is determined). We then mesure the PSNR of the estimted ptches with respect to the true ones. The results re shown in Figure 3(d), gin for vrious vlues of σ. As cn be observed, the MOE-bsed recovery is significntly better, specilly in the high SNR rnge. Notoriously, the more ccurte ctive set recovery of MOE does not seem to improve the denoising performnce in this cse. However, s we will see next, it does mke difference when denoising rel life signls, s well s for clssifiction tsks. D. Recovery of rel signls with simulted noise This experiment is n nlogue to the previous one, when the dt re the originl nturl imge ptches (without forcing exct sprsity). Since for this cse the sprsity ssumption is only pproximte, nd no ground truth is vilble for the ctive sets, we compre the different methods in terms of their denoising performnce. A criticl strtegy in imge denoising is the use of overlpping ptches, where for ech pixel in the imge ptch is extrcted with tht pixel s its center. The ptches re denoised independently s M-dimensionl signls

14 14 Fig. 4. Smple imge denoising results. Top: Brbr, σ = 3. Bottom: Bots, σ = 4. From left to right: noisy, `1 /OMP, `1 /`1, MOE/MOE. The reconstruction obtined with the proposed model is more ccurte, s evidenced by better reconstruction of the texture in Brbr, nd shrp edges in Bots, nd does not produce the rtifcts seen in both the `1 nd ` reconstructions, which pper s blck/white speckles ll over Brbr, nd ringing on the edges in Bots. σ = 1 lerning coding brbr bot len peppers mn AVERAGE `1 ` 3.4/ / / / / /34.2 σ = 2 `1 31.2/ / / / / /33.9 ` 3.5/ / / / / /34.2 `1 [1] MOE MOE 3.9/ / / / / / ` 26.5/ / / / / /3.6 σ = 3 `1 26.9/ / / / / /3.4 ` 26.8/ / / / / /3.6 `1 [1] MOE MOE 27./ / / / / / ` 24.5/ / / / / /28.3 [1] MOE `1 24.8/ / / / / /28.5 ` 24.8/ / / / / /28.4 MOE 24.9/ / / / / / TABLE I D ENOISING RESULTS : IN EACH TABLE, EACH COLUMN SHOWS THE DENOISING PERFORMANCE OF A LEARNING + CODING COMBINATION. R ESULTS ARE SHOWN IN PAIRS, WHERE THE LEFT NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED INDIVIDUAL PATCHES, AND THE RIGHT NUMBER IS THE PSNR BETWEEN THE CLEAN AND RECOVERED IMAGES. B EST RESULTS ARE IN BOLD. T HE PROPOSED MOE PRODUCES BETTER FINAL RESULTS OVER BOTH THE ` AND `1 ONES IN ALL CASES, AND AT PATCH LEVEL FOR ALL σ > 1. N OTE THAT THE AVERAGE VALUES REPORTED ARE THE PSNR OF THE AVERAGE MSE, AND NOT THE PSNR AVERAGE. nd then recombined into the finl denoised imges by simple verging. Although this consistently improves the finl result in ll cses, the improvement is very different depending on the method used to denoise the individul ptches. Therefore, we now compre the denoising performnce of ech method t two levels: individul ptches nd finl imge. To denoise ech imge, the globl dictionry described in Section V-A is further dpted to the noisy imge ptches using (21) for few itertions, nd used to encode the noisy ptches vi (2) with = CM σ 2. We repeted the experiment for two lerning vrints (`1 nd MOE regulrizers), nd two coding vrints ((2) with the regulrizer used for lerning, nd ` vi OMP. The four vrints were pplied to the stndrd imges Brbr, Bots, Len, Mn nd Peppers, nd the results summrized in Tble I. We show smple results in Figure 4. Although the quntittive improvements seen in Tble I re smll compred to `1, there is significnt improvement t the visul level, s cn be seen in Figure 4. In ll cses the PSNR obtined coincides or surpsses the ones reported in [1].1 E. Zooming As n exmple of signl recovery in the bsence of noise, we took the previous set of imges, plus prticulrly chllenging one (Tools), nd subsmpled them to hlf ech side. We then simulted zooming effect by upsmpling 1 Note tht in [1], the denoised imge is finlly blended with the noisy imge using n empiricl weight, providing n extr improvement to the finl PSNR in some cses. The results in I re lredy better without this extr step.

15 15 imge cubic l l 1 MOE brbr bot len peppers mn tools AVER Fig. 5. Zooming results. Left to right: summry, Tools imge, detil of zooming results for the frmed region, top to bottom nd left to right: cubic, l, l 1, MOE. As cn be seen, the MOE result is s shrp s l but produces less rtifcts. This is reflected in the.1db overll improvement obtined with MOE, s seen in the leftmost summry tble. them nd estimting ech of the 75% missing pixels (see e.g., [5] nd references therein). We use technique similr to the one used in [32]. The imge is first interpolted nd then deconvoluted using Wiener filter. The deconvoluted imge hs rtifcts tht we tret s noise in the reconstruction. However, since there is no rel noise, we do not perform verging of the ptches, using only the center pixel of ˆx j to fill in the missing pixel t j. The results re summrized in Figure 5, where we gin observe tht using MOE insted of l nd l 1 improves the results. F. Clssifiction with universl sprse models In this section we pply our proposed universl models to clssifiction problem where ech smple x j is to be ssigned clss lbel y j = 1,..., c, which serves s n index to the set of possible clsses, {C 1, C 2,..., C c }. We follow the procedure of [36], where the clssifier ssigns ech smple x j by mens of the mximum posteriori criterion (5) with the term log P () corresponding to the ssumed prior, nd the dictionries representing ech clss re lerned from trining smples using (21) with the corresponding regulrizer ψ() = log P (). Ech experiment is repeted for the bseline Lplcin model, implied in the l 1 regulrizer, nd the universl model MOE, nd the results re then compred. In this cse we expect tht the more ccurte prior model for the coefficients will result in n improved likelihood estimtion, which in turn should improve the ccurcy of the system. We begin with clssic texture clssifiction problem, where ptches hve to be identified s belonging to one out of number of possible textures. In this cse we experimented with smples of c = 2 nd c = 3 textures drwn t rndom from the Brodtz dtbse, 11, the ones ctully used shown in Figure 6. In ech cse the experiment ws repeted 1 times. In ech repetition, dictionry of K = 3 toms ws lerned from ll ptches of the leftmost hlves of ech smple texture. We then clssified the ptches from the rightmost hlves of the texture smples. For the c = 2 we obtined n verge error rte of 5.13% using l 1 ginst 4.12% when using MOE, which represents reduction of 2% in clssifiction error. For c = 3 the verge error rte obtined ws 13.54% using l 1 nd 11.48% using MOE, which is 15% lower. Thus, using the universl model insted of l 1 yields significnt improvement in this cse (see for exmple [26] for other results in clssifiction of Brodtz textures). The second smple problem presented is the Grz 2 bike detection problem, 12 where ech pixel of ech testing imge hs to be clssified s either bckground or s prt of bike. In the Grz 2 dtset, ech of the pixels cn belong to one of two clsses: bike or bckground. On ech of the trining imges (which by convention re the first 15 even-numbered imges), we re given msk tht tells us whether ech pixel belongs to bike or to the bckground. We then trin dictionry for bike ptches nd nother for bckground ptches. Ptches tht contin pixels from both clsses re ssigned to the clss corresponding to the mjority of their pixels trnden/brodtz.html 12

16 16 Fig. 6. Textures used in the texture clssifiction exmple. Fig. 7. Clssifiction results. Left to right: precision vs. recll curve, smple imge from the Grz 2 dtset, its ground truth, nd the corresponding estimted mps obtined with `1 nd MOE for fixed threshold. The precision vs. recll curve shows tht the mixture model gives better precision in ll cses. In the exmple, the clssifiction obtined with MOE yields less flse positives nd more true positives thn the one obtined with `1. In Figure 7 we show the precision vs. recll curves obtined with the detection frmework when either the `1 or the MOE regulrizers were used in the system. As cn be seen, the MOE-bsed model outperforms the `1 in this clssifiction tsk s well, giving better precision for ll recll vlues. In the bove experiments, the prmeters for the `1 prior (λ), the MOE model (λmoe ) nd the incoherence term (µ) were ll djusted by cross vlidtion. The only exception is the MOE prmeter β, which ws chosen bsed on the fitting experiment s β =.7. VI. C ONCLUDING REMARKS A frmework for designing sprse modeling priors ws introduced in this work, using tools from universl coding, which formlizes sprse coding nd modeling from MDL perspective. The priors obtined led to models with both theoreticl nd prcticl dvntges over the trditionl ` nd `1 -bsed ones. In ll derived cses, the designed non-convex problems re suitble to be efficiently (pproximtely) solved vi few itertions of (weighted) `1 subproblems. We lso showed tht these priors re ble to fit the empiricl distribution of sprse codes of imge ptches significntly better thn the trditionl IID Lplcin model, nd even the non-identiclly distributed independent Lplcin model where different Lplcin prmeter is djusted to the coefficients ssocited to ech tom, thus showing the flexibility nd ccurcy of these proposed models. The dditionl flexibility, furthermore, comes t smll cost of only 2 prmeters tht cn be esily nd efficiently tuned (either (κ, β) in the MOE model, or (θ1, θ2 ) in the JOE model), insted of K (dictionry size), s in weighted `1 models. The dditionl ccurcy of the proposed models ws shown to hve significnt prcticl impct in ctive set recovery of sprse signls, imge denoising, nd clssifiction pplictions. Compred to the Byesin pproch, we void the potentil burden of solving severl smpled sprse problems, or being forced to use conjugte prior for computtionl resons (lthough in our cse, fortiori, the conjugte prior does provide us with good model). Overll, s demonstrted in this pper, the introduction of informtion theory tools cn led to formlly ddressing criticl spects of sprse modeling.

17 17 Future work in this direction includes the design of priors tht tke into ccount the nonzero mss t = tht ppers in overcomplete models, nd online lerning of the model prmeters from noisy dt, following for exmple the technique in [3]. ACKNOWLEDGMENTS Work prtilly supported by NGA, ONR, ARO, NSF, NSSEFF, nd FUNDACIBA-ANTEL. We wish to thnk Julien Mirl for providing us with his fst sprse modeling toolbox, SPAMS. 13 We lso thnk Federico Lecumberry for his prticiption on the incoherent dictionry lerning method, nd helpful comments. APPENDIX DERIVATION OF THE MOE MODEL In this cse we hve P ( θ) = θe θ nd w(θ κ, β) = 1 Γ(κ) θκ 1 β κ e βθ, which, when plugged into (9), gives Q( β, κ) = θ= θe θ 1 Γ(κ) θκ 1 β κ e βθ dθ = βκ Γ(κ) θ= e θ(+β) θ κ dθ. After the chnge of vribles u := ( + β)θ (u() =, u( ) = ), the integrl cn be written s ( ) Q( β, κ) = βκ u k e u du Γ(κ) + β + β = βκ ( + β) (κ+1) e u u k du Γ(κ) θ= = βκ Γ(κ) ( + β) (κ+1) Γ(κ + 1) = βκ Γ(κ) ( + β) (κ+1) κγ(κ), obtining Q( β, κ) = κβ κ ( + β) (κ+1), since the integrl on the second line is precisely the definition of Γ(κ + 1). The symmetriztion is obtined by substituting by nd dividing the normliztion constnt by two, Q( β, κ) =.5κβ κ ( + β) (κ+1). The men of the MOE distribution (which is defined only for κ > 1) cn be esily computed using integrtion by prts, µ(β, κ) = κβ κ [ u du = κβ (u + β) (κ+1) u κ(u + β) κ + 1 κ θ= ] du (u + β) k = β κ 1. In the sme wy, it is esy to see tht the non-centrl moments of order i re µ i = ( κ 1 i ). The MLE estimtes of κ nd β cn be obtined using ny nonliner optimiztion technique such s Newton method, using for exmple the estimtes obtined with the method of moments s strting point. In prctice, however, we hve not observed ny significnt improvement in using the MLE estimtes over the moments-bsed ones. β Expected pproximtion error in cost function As mentioned in the optimiztion section, the LLA pproximtes the MOE regulrizer s weighted l 1. Here we develop n expression for the expected error between the true function nd the pproximte convex one, where the expecttion is tken (nturlly) with respect to the MOE distribution. Given the vlue of the current iterte (t) =, (ssumed positive, since the function nd its pproximtion re symmetric), the pproximted regulrizer is ψ (t) () = log( + β) + 1 ( +β ). We hve ] E MOE(κ,β) [ψ (t) () ψ() = 13 κβ κ ( + κ) κ+1 = log( + β) + + β + = log( + β) + + β + [ log( + β) + 1 κβκ + β ] + β ( ) log( + β) d κβκ ( + β) κ+1 β ( + β)(κ 1) log β 1 κ. d log( + β) d ( + β) κ+1

18 18 DERIVATION OF THE CONSTRAINED JEFFREYS (JOE) MODEL In the cse of the exponentil distribution, the Fisher Informtion Mtrix in (15) evlutes to [ ]} { [ ]} 2 I(θ) = {E P ( θ) ( log θ + θ log ) = E θ 2 P ( θ) = 1 θ2 1 θ 2. 1 By plugging this result into (14) with Θ = [θ 1, θ 2 ], < θ 1 < θ 2 < we obtin w(θ) = derive the (one-sided) JOE probbility density function by plugging this w(θ) in (9), Q() = θ2 θ 1 θe θ 1 ln(θ 2 /θ 1 ) dθ θ = 1 ln(θ 2 /θ 1 ) θ2 θ=θ θ 1 e θ dθ = 1 1 ln(θ 2 /θ 1 ) θ=θ 1 ln(θ 2/θ 1) θ ( e θ1 e θ2).. We now Although Q() cnnot be evluted t =, the limit for exists nd is finite, so we cn just define Q() s this limit, which is 1 [ lim Q() = lim 1 θ1 + o( 2 ) (1 θ 2 + o( 2 )) ] = θ 2 θ 1 ln(θ 2 /θ 1 ) ln(θ 2 /θ 1 ). Agin, if desired, prmeter estimtion cn be done for exmple using mximum likelihood (vi nonliner optimiztion), or using the method of moments. However, in this cse, the method of moments does not provide closed form solution for (θ 1, θ 2 ). The non-centrl moments of order i re + i 1 [ µ i = e θ1 e θ1] { } d= i 1 e θ1 d i 1 e θ2 d. (22) ln(θ 2 /θ 1 ) ln(θ 2 /θ 1 ) 1 For i = 1, both integrls in (22) re trivilly evluted, yielding µ 1 = ln(θ 2/θ 1) (θ 1 cn be solved using integrtion by prts: + µ + i = i 1 e θ1 d = i ( θ 1 ) e θ1 1 (i 1) ( θ 1 ) + µ i = i 1 e θ2 d = i ( θ 2 ) e θ2 1 (i 1) ( θ 2 ) 1 θ ). For i > 1, these integrls i 2 e θ1 d i 2 e θ2 d, where the first term in the right hnd side of both equtions evlutes to for i > 1. Therefore, for i > 1 we obtin the recursions µ + i = i 1 θ 1 µ + i 1, µ i = i 1 θ 2 µ i 1, which, combined with the result for i = 1, give the finl expression for ll the moments of order i > ( (i 1)! 1 µ i = ln(θ 2 /θ 1 ) θ1 i 1 ) θ2 i, i = 1, 2,.... In prticulr, for i = 1 nd i = 2 we hve θ 1 = ( ln(θ 2 /θ 1 )µ 1 + θ2 1 ) 1, θ2 = ( ln(θ 2 /θ 1 )µ 2 + θ1 2 ) 1, which, when combined, give us 2µ 1 2µ 1 θ 1 = µ 2 + ln(θ 2 /θ 1 )µ 2, θ 2 = 1 µ 2 ln(θ 2 /θ 1 )µ 2. (23) 1 One possibility is to solve the nonliner eqution θ 2 /θ 1 = µ2+ln(θ2/θ1)µ2 1 µ 2 ln(θ 2/θ 1)µ for u = θ 2 1 /θ 2 by finding the roots of the 1 nonliner eqution u = µ2+ln uµ2 1 µ 2 ln uµ nd choosing one of them bsed on some side informtion. Another possibility 2 1 is to simply fix the rtio θ 2 /θ 1 beforehnd nd solve for θ 1 nd θ 2 using (23). DERIVATION OF THE CONDITIONAL JEFFREYS (CMOE) MODEL The conditionl Jeffreys method defines proper prior w(θ) by ssuming tht n smples from the dt to be modeled were lredy observed. Plugging the Fisher informtion for the exponentil distribution, I(θ) = θ 2, into (18) we obtin w(θ) = P ( n θ)θ 1 Θ P ξ)ξ 1 dξ = ( n j=1 )θ 1 θe θj + (n ( n j=1 )ξ 1 dξ ξe ξj n θn 1e θ j=1 j = + ξ n 1 e ξ n j=1 j dξ.

19 19 Denoting S = n j=1 j nd performing the chnge of vribles u := S ξ we obtin w(θ) = S n θ n 1 e Sθ + u n 1 e u du = S n θn 1 e Sθ Γ(n ) where the lst eqution derives from the definition of the Gmm function, Γ(n ). We see tht the resulting prior w(θ) is Gmm distribution Gmm(κ, β ) with κ = n nd β = S = n j=1 j. REFERENCES [1] M. Ahron, M. Eld, nd A. Bruckstein. The K-SVD: An lgorithm for designing of overcomplete dictionries for sprse representtions. IEEE Trns. SP, 54(11): , Nov. 26. [2] A. Brron, J. Rissnen, nd B. Yu. The minimum description length principle in coding nd modeling. IEEE Trns. IT, 44(6): , [3] J. Bernrdo nd A. Smith. Byesin Theory. Wiley, [4] A. Bruckstein, D. Donoho, nd M. Eld. From sprse solutions of systems of equtions to sprse modeling of signls nd imges. SIAM Review, 51(1):34 81, Feb. 29. [5] E. J. Cndès. Compressive smpling. Proc. of the Interntionl Congress of Mthemticins, 3, Aug. 26. [6] E. J. Cndès, M. Wkin, nd S. Boyd. Enhncing sprsity by reweighted l 1 minimiztion. J. Fourier Anl. Appl., 14(5):877 95, Dec. 28. [7] R. Chrtrnd. Fst lgorithms for nonconvex compressive sensing: MRI reconstruction from very few dt. In IEEE ISBI, June 29. [8] S. Chen, D. Donoho, nd M. Sunders. Atomic decomposition by bsis pursuit. SIAM Journl on Scientific Computing, 2(1):33 61, [9] R. Coifmn nd M. Wickenhuser. Entropy-bsed lgorithms for best bsis selection. IEEE Trns. IT, 38: , [1] T. Cover nd J. Thoms. Elements of informtion theory. John Wiley nd Sons, Inc., 2 edition, 26. [11] I. Dubechies, M. Defrise, nd C. De Mol. An itertive thresholding lgorithm for liner inverse problems with sprsity constrint. Comm. on Pure nd Applied Mthemtics, 57: , 24. [12] B. Efron, T. Hstie, I. Johnstone, nd R. Tibshirni. Lest ngle regression. Annls of Sttistics, 32(2):47 499, 24. [13] M. Eld. Optimized projections for compressed-sensing. IEEE Trns. SP, 55(12): , Dec. 27. [14] K. Engn, S. Ase, nd J. Husoy. Multi-frme compression: Theory nd design. Signl Processing, 8(1): , Oct. 2. [15] M. Everinghm, A. Zissermn, C. Willims, nd L. Vn Gool. The PASCAL Visul Object Clsses Chllenge 26 (VOC26) Results. [16] J. Fn nd R. Li. Vrible selection vi nonconcve penlized likelihood nd its orcle properties. Journl Am. Stt. Assoc., 96(456): , Dec. 21. [17] M. Figueiredo. Adptive sprseness using Jeffreys prior. In Thoms G. Dietterich, Suznn Becker, nd Zoubin Ghhrmni, editors, Adv. NIPS, pges MIT Press, Dec. 21. [18] S. Foucrt nd M. Li. Sprsest solutions of underdetermined liner systems vi l q-minimiztion for < q 1. Applied nd Computtionl Hrmonic Anlysis, 3(26):395 47, 29. [19] G. Gsso, A. Rkotommonjy, nd S. Cnu. Recovering sprse signls with non-convex penlties nd DC progrmming. IEEE Trns. SP, 57(12): , 29. [2] R. Giryes, Y. Eldr, nd M. Eld. Automtic prmeter setting for itertive shrinkge methods. In IEEE 25-th Convention of Electronics nd Electricl Engineers in Isrel (IEEEI 8), Dec. 28. [21] P. Grünwld. The Minimum Description Length Principle. MIT Press, June 27. [22] T. Hstie, R. Tibshirni, nd J. Friedmn. The Elements of Sttisticl Lerning: Dt Mining, Inference nd Prediction. Springer, 2 edition, Feb. 29. [23] S. Ji, Y. Xue, nd L. Crin. Byesin compressive sensing. IEEE Trns. SP, 56(6): , 28. [24] B. Krishnpurm, L. Crin, M. Figueiredo, nd A. Hrtemink. Sprse multinomil logistic regression: Fst lgorithms nd generliztion bounds. IEEE Trns. PAMI, 27(6): , 25. [25] E. Lm nd J. Goodmn. A mthemticl nlysis of the DCT coefficient distributions for imges. IEEE Trns. IP, 9(1): , 2. [26] J. Mirl, F. Bch, J. Ponce, G. Spiro, nd A. Zissermn. Supervised dictionry lerning. In D. Koller, D. Schuurmns, Y. Bengio, nd L. Bottou, editors, Adv. NIPS, volume 21, Dec. 29. [27] J. Mirl, G. Spiro, nd M. Eld. Lerning multiscle sprse representtions for imge nd video restortion. SIAM MMS, 7(1): , April 28. [28] S. Mllt nd Z. Zhng. Mtching pursuit in time-frequency dictionry. IEEE Trns. SP, 41(12): , [29] N. Merhv nd M. Feder. Universl prediction. IEEE Trns. IT, 44(6): , Oct [3] G. Mott, E. Ordentlich, I. Rmirez, G. Seroussi, nd M. Weinberger. The DUDE frmework for gryscle imge denoising. Technicl report, HP lbortories, [31] P. Moulin nd J. Liu. Anlysis of multiresolution imge denoising schemes using generlized-gussin nd complexity priors. IEEE Trns. IT, April [32] R. Neelmni, H. Choi, nd R. Brniuk. Forwrd: Fourier-wvelet regulrized deconvolution for ill-conditioned systems. IEEE Trns. SP, 52(2): , 24. [33] B. Olshusen nd D. Field. Sprse coding with n overcomplete bsis set: A strtegy employed by V1? Vision Reserch, 37: , 1997.,

20 [34] R. Rin, A. Bttle, H. Lee, B. Pcker, nd A. Ng. Self-tught lerning: trnsfer lerning from unlbeled dt. In ICML, pges , June 27. [35] I. Rmirez, F. Lecumberry, nd G. Spiro. Universl priors for sprse modeling. In CAMSAP, Dec. 29. [36] I. Rmírez, P. Sprechmnn, nd G. Spiro. Clssifiction nd clustering vi dictionry lerning with structured incoherence nd shred fetures. In CVPR, June 21. [37] J. Rissnen. Universl coding, informtion, prediction nd estimtion. IEEE Trns. IT, 3(4), July [38] J. Rissnen. Stochstic complexity in sttisticl inquiry. Singpore: World Scientific, [39] R. Sb, R. Chrtrnd, nd O. Yilmz. Stble sprse pproximtion vi nonconvex optimiztion. In ICASSP, April 28. [4] N. Sito. Simultneous noise suppression nd signl compression using librry of orthonorml bses nd the MDL criterion. In E. Foufoul-Georgiou nd P. Kumr, editors, Wvelets in Geophysics, pges New York: Acdemic, [41] G. Schwrtz. Estimting the dimension of model. Annls of Sttistics, 6(2): , [42] Y. Shtrkov. Universl sequentil coding of single messges. Probl. Inform. Trnsm., 23(3):3 17, July [43] R. Tibshirni. Regression shrinkge nd selection vi the LASSO. Journl of the Royl Sttisticl Society: Series B, 58(1): , [44] M. Tipping. Sprse byesin lerning nd the relevnce vector mchine. Journl of Mchine Lerning, 1: , 21. [45] J. Tropp. Greed is good: Algorithmic results for sprse pproximtion. IEEE Trns. IT, 5(1): , Oct. 24. [46] J. Trzsko nd A. Mnduc. Highly undersmpled mgnetic resonnce imge reconstruction vi homotopic l -minimiztion. IEEE Trns. MI, 28(1):16 121, Jn. 29. [47] J. Trzsko nd A. Mnduc. Relxed conditions for sprse signl recovery with generl concve priors. IEEE Trns. SP, 57(11): , 29. [48] D. Wipf, J. Plmer, nd B. Ro. Perspectives on sprse byesin lerning. In Adv. NIPS, Dec. 23. [49] D. Wipf nd B. Ro. An empiricl byesin strtegy for solving the simultneous sprse pproximtion problem. IEEE Trns. IP, 55(7-2): , 27. [5] G. Yu, G. Spiro, nd S. Mllt. Solving inverse problems with piecewise liner estimtors: From Gussin mixture models to structured sprsity. Preprint rxiv: [51] H. Zou. The dptive LASSO nd its orcle properties. Journl Am. Stt. Assoc., 11: , 26. [52] H. Zou nd R. Li. One-step sprse estimtes in nonconcve penlized likelihood models. Annls of Sttistics, 36(4): , 28. 2