Online Dictionary Learning for Sparse Coding

Julien Mairal JULIEN.MAIRAL@INRIA.FR Francis Bach FRANCIS.BACH@INRIA.FR INRIA, 45rued Ulm75005Paris,France Jean Ponce JEAN.PONCE@ENS.FR EcoleNormaleSupérieure, 45rued Ulm75005Paris,France Guillermo Sapiro GUILLE@UMN.EDU University of Minnesota- Department of Electrical and Computer Engineering, 200 Union Street SE, Minneapolis, USA Abstract Sparse coding that is, modelling data vectors as sparse linear combinations of basis elements is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be very effective for signal reconstruction and classification in the audio and image processing domains. This paper proposes a new online optimization algorithm for dictionary learning, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples. A proof of convergence is presented, along with experiments with natural images demonstrating that it leads to faster performance and better dictionaries than classical batch algorithms for both small and large datasets.. Introduction Thelineardecompositionofasignalusingafewatomsof a learned dictionary instead of a predefined one based on wavelets(mallat, 999) for example has recently led to state-of-the-art results for numerous low-level image processing tasks such as denoising(elad& Aharon, 2006) as well as higher-level tasks such as classification(raina et al., 2007; Mairal et al., 2009), showing that sparse learned models are well adapted to natural signals. Un- WILLOWProject, Laboratoired Informatiquedel Ecole Normale Supérieure, ENS/INRIA/CNRS UMR 8548. AppearinginProceedingsofthe 26 th InternationalConference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). like decompositions based on principal component analysisanditsvariants,thesemodelsdonotimposethatthe basis vectors be orthogonal, allowing more flexibility to adapt the representation to the data. While learning the dictionary has proven to be critical to achieve(or improve upon) state-of-the-art results, effectively solving the corresponding optimization problem is a significant computational challenge, particularly in the context of the largescale datasets involved in image processing tasks, that may include millions of training samples. Addressing this challengeisthetopicofthispaper. Concretely,considerasignal xin R m.wesaythatitadmitsasparseapproximationoveradictionary Din R m k, with kcolumnsreferredtoasatoms,whenonecanfinda linearcombinationofa few atomsfrom Dthatis close to the signal x. Experiments have shown that modelling a signal with such a sparse decomposition(sparse coding) is very effective in many signal processing applications(chen et al., 999). For natural images, predefined dictionaries based on various types of wavelets(mallat, 999) have been used for this task. However, learning the dictionary instead of using off-the-shelf bases has been shown to dramatically improve signal reconstruction(elad& Aharon, 2006). Although some of the learned dictionary elements may sometimes look like wavelets(or Gabor filters), they aretunedtotheinputimagesorsignals,leadingtomuch better results in practice. Most recent algorithms for dictionary learning(olshausen &Field, 997; Aharonetal., 2006; Leeetal., 2007) are second-order iterative batch procedures, accessing the wholetrainingsetateachiterationinordertominimizea cost function under some constraints. Although they have shown experimentally to be much faster than first-order gradient descent methods(lee et al., 2007), they cannot effectively handle very large training sets(bottou& Bousquet, 2008), or dynamic training data changing over time,

such as video sequences. To address these issues, we propose an online approach that processes one element(or a smallsubset)ofthetrainingsetatatime.thisisparticularly important in the context of image and video processing(protter&elad,2009),whereitiscommontolearn dictionaries adapted to small patches, with training data that may include several millions of these patches(roughly oneperpixelandperframe). Inthissetting,onlinetechniques based on stochastic approximations are an attractive alternative to batch methods(bottou, 998). For example, first-order stochastic gradient descent with projections on the constraint set is sometimes used for dictionary learning(seeaharonandelad(2008)forinstance). Weshow inthispaperthatitispossibletogofurtherandexploitthe specificstructureofsparsecodinginthedesignofanoptimization procedure dedicated to the problem of dictionary learning, with low memory consumption and lower computational cost than classical second-order batch algorithms and without the need of explicit learning rate tuning. As demonstrated by our experiments, the algorithm scales up gracefully to large datasets with millions of training samples, and it is usually faster than more standard methods... Contributions This paper makes three main contributions. WecastinSection2thedictionarylearningproblemas the optimization of a smooth nonconvex objective function over a convex set, minimizing the(desired) expected cost when the training set size goes to infinity. WeproposeinSection3aniterativeonlinealgorithmthat solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the setofconstraints. ThismethodisshowninSection4to converge with probability one to a stationary point of the cost function. As shown experimentally in Section 5, our algorithm is significantly faster than previous approaches to dictionary learning on both small and large datasets of natural images. To demonstrate that it is adapted to difficult, largescale image-processing tasks, we learn a dictionary on a 2-Megapixel photograph and use it for inpainting. 2. Problem Statement Classical dictionary learning techniques (Olshausen & Field,997;Aharonetal.,2006;Leeetal.,2007)consider afinitetrainingsetofsignals X = [x,...,x n ]in R m n and optimize the empirical cost function f n (D) = n l(x i,d), () n i= where Din R m k isthedictionary,eachcolumnrepresentingabasisvector,and lisalossfunctionsuchthat l(x,d) shouldbesmallif Dis good atrepresentingthesignal x. Thenumberofsamples nisusuallylarge,whereasthesignaldimension misrelativelysmall,forexample, m = 00 for 0 0imagepatches,and n 00,000fortypical image processing applications. In general, we also have k n(e.g., k = 200for n = 00,000),andeachsignal onlyusesafewelementsof Dinitsrepresentation.Note that, in this setting, overcomplete dictionaries with k > m areallowed.asothers(see(leeetal.,2007)forexample), wedefine l(x,d)astheoptimalvalueofthe l -sparsecoding problem: l(x,d) = min α R k 2 x Dα 2 2 + λ α, (2) where λisaregularizationparameter. 2 Thisproblemis alsoknownasbasispursuit(chenetal.,999),orthe Lasso(Tibshirani, 996). Itiswellknownthatthe l penaltyyieldsasparsesolutionfor α,butthereisnoanalytic link between the value of λ and the corresponding effectivesparsity α 0.Toprevent Dfrombeingarbitrarily large(which would lead to arbitrarily small values of α),itiscommontoconstrainitscolumns (d j ) k j= tohave an l 2 normlessthanorequaltoone. Wewillcall Cthe convex set of matrices verifying this constraint: C = {D R m k s.t. j =,...,k, d T j d j }. (3) Note that the problem of minimizing the empirical cost f n (D)isnotconvexwithrespectto D. Itcanberewrittenasajointoptimizationproblemwithrespecttothedictionary Dandthecoefficients α = [α,...,α n ]ofthe sparse decomposition, which is not jointly convex, but convexwithrespecttoeachofthetwovariables Dand αwhen theotheroneisfixed: min D C,α R k n n n i= ( 2 x i Dα i 2 2 + λ α i ). (4) A natural approach to solving this problem is to alternate between the two variables, minimizing over one while keepingtheotheronefixed, asproposedbyleeetal. (2007)(seealsoAharonetal. (2006),whouse l 0 rather than l penalties, for related approaches). 3 Since the computation of α dominates the cost of each iteration, a second-order optimization technique can be used in this casetoaccuratelyestimate Dateachstepwhen αisfixed. As pointed out by Bottou and Bousquet(2008), however, one is usually not interested in a perfect minimization of 2 The l pnormofavector xin R m isdefined,for p,by x p = ( P m i= x[i] p ) /p. Followingtradition,wedenoteby x 0thenumberofnonzeroelementsofthevector x.this l 0 sparsitymeasureisnotatruenorm. 3 Inoursetting,asin(Leeetal.,2007),weusetheconvex l norm, that has empirically proven to be better behaved in general thanthe l 0pseudo-normfordictionarylearning.

theempiricalcost f n (D),butintheminimizationofthe expected cost f(d) = E x [l(x,d)] = lim n f n(d) a.s., (5) where the expectation(which is assumed finite) is taken relative to the(unknown) probability distribution p(x) of the data. 4 Inparticular,givenafinitetrainingset,oneshould not spend too much effort on accurately minimizing the empiricalcost,sinceitisonlyanapproximationoftheexpected cost. Bottou and Bousquet(2008) have further shown both theoretically and experimentally that stochastic gradient algorithms, whose rate of convergence is not good in conventional optimization terms, may in fact in certain settings be the fastest in reaching a solution with low expected cost. With large training sets, classical batch optimization techniques may indeed become impractical in terms of speed or memory requirements. In the case of dictionary learning, classical projected firstorder stochastic gradient descent(as used by Aharon and Elad(2008) for instance) consists of a sequence of updates of D: D t = Π C [D t ρ ] t Dl(x t,d t ), (6) where ρisthegradientstep, Π C istheorthogonalprojectoron C,andthetrainingset x,x 2,...arei.i.d.samples of the(unknown) distribution p(x). As shown in Section 5, we have observed that this method can be competitive compared to batch methods with large training sets, when agoodlearningrate ρisselected. The dictionary learning method we present in the next section falls into the class of online algorithms based on stochastic approximations, processing one sample at a time, but exploits the specific structure of the problem to efficiently solve it. Contrary to classical first-order stochastic gradient descent, it does not require explicit learning rate tuning and minimizes a sequentially quadratic local approximations of the expected cost. 3. Online Dictionary Learning Wepresentinthissectionthebasiccomponentsofouronline algorithm for dictionary learning(sections 3. 3.3), as well as two minor variants which speed up our implementation(section 3.4). 3.. Algorithm Outline Our algorithm is summarized in Algorithm. Assuming the training set composed of i.i.d. samples of a distribu- 4 Weuse a.s. (almostsure)todenoteconvergencewithprobability one. Algorithm Online dictionary learning. Require: x R m p(x)(randomvariableandanalgorithmtodrawi.i.dsamplesof p), λ R(regularization parameter), D 0 R m k (initialdictionary), T(number of iterations). : A 0 0, B 0 0(resetthe past information). 2: for t = to Tdo 3: Draw x t from p(x). 4: Sparse coding: compute using LARS α t = arg min α R 2 x t D t α 2 2 + λ α. (8) k 5: A t A t + α t α T t. 6: B t B t + x t α T t. 7: Compute D t usingalgorithm2,with D t aswarm restart, so that D t = arg min D C = arg min D C t t t i= 8: endfor 9: Return D T (learneddictionary). 2 x i Dα i 2 2 + λ α i, ( 2 Tr(DT DA t ) Tr(D T B t ) ). tion p(x),itsinnerloopdrawsoneelement x t atatime, as in stochastic gradient descent, and alternates classical sparsecodingstepsforcomputingthedecomposition α t of x t overthedictionary D t obtainedatthepreviousiteration, with dictionary update steps where the new dictionary D t iscomputedbyminimizingover Cthefunction ˆf t (D) = t t i= (9) 2 x i Dα i 2 2 + λ α i, (7) wherethevectors α i arecomputedduringtheprevious steps of the algorithm. The motivation behind our approach is twofold: Thequadraticfunction ˆf t aggregatesthepastinformation computed during the previous steps of the algorithm, namelythevectors α i,anditiseasytoshowthatitupperboundstheempiricalcost f t (D t )fromeq.(). One keyaspectoftheconvergenceanalysiswillbetoshowthat ˆf t (D t )and f t (D t )convergesalmostsurelytothesame limitandthus ˆf t actsasasurrogatefor f t. Since ˆf t iscloseto ˆf t, D t canbeobtainedefficiently using D t aswarmrestart. 3.2. Sparse Coding ThesparsecodingproblemofEq.(2)withfixeddictionaryisan l -regularizedlinearleast-squaresproblem. A

Algorithm 2 Dictionary Update. Require: D = [d,...,d k ] R m k (inputdictionary), A = [a,...,a k ] R k k = t i= α iα T i, B = [b,...,b k ] R m k = t i= x iα T i. : repeat 2: for j = to kdo 3: Update the j-th column to optimize for(9): u j A jj (b j Da j ) + d j. d j max( u j 2,) u j. 4: endfor 5: until convergence 6: Return D(updated dictionary). (0) number of recent methods for solving this type of problems are based on coordinate descent with soft thresholding(fu, 998; Friedman et al., 2007). When the columns of the dictionary have low correlation, these simple methods have proven to be very efficient. However, the columns of learned dictionaries are in general highly correlated, and we have empirically observed that a Cholesky-based implementation of the LARS-Lasso algorithm, an homotopy method(osborneetal.,2000;efronetal.,2004)thatprovides the whole regularization path that is, the solutions forallpossiblevaluesof λ,canbeasfastasapproaches based on soft thresholding, while providing the solution with a higher accuracy. 3.3. Dictionary Update Our algorithm for updating the dictionary uses blockcoordinate descent with warm restarts, and one of its main advantages is that it is parameter-free and does not require anylearningratetuning,whichcanbedifficultinaconstrained optimization setting. Concretely, Algorithm 2 sequentially updates each column of D. Using some simple algebra,itiseasytoshowthateq.(0)givesthesolution ofthedictionaryupdate(9)withrespecttothe j-thcolumn d j,whilekeepingtheotheronesfixedundertheconstraint d T j d j. Sincethisconvexoptimizationproblemadmits separable constraints in the updated blocks(columns), convergence to a global optimum is guaranteed(bertsekas, 999).Inpractice,sincethevectors α i aresparse,thecoefficientsofthematrix Aareingeneralconcentratedonthe diagonal, which makes the block-coordinate descent more efficient. 5 Sinceouralgorithmusesthevalueof D t asa 5 Notethatthisassumptiondoesnotexactlyhold:Tobemore exact,ifagroupofcolumnsin Darehighlycorrelated,thecoefficients of the matrix A can concentrate on the corresponding principal submatrices of A. warmrestartforcomputing D t,asingleiterationhasempirically been found to be enough. Other approaches have beenproposedtoupdate D,forinstance,Leeetal.(2007) suggestusinganewtonmethodonthedualofeq.(9),but thisrequiresinvertingak kmatrixateachnewtoniteration, which is impractical for an online algorithm. 3.4. Optimizing the Algorithm Wehavepresentedsofarthebasicbuildingblocksofour algorithm. This section discusses simple improvements that significantly enhance its performance. Handling Fixed-Size Datasets. In practice, although it maybeverylarge,thesizeofthetrainingsetisoftenfinite(ofcoursethismaynotbethecase,whenthedata consistsofavideostreamthatmustbetreatedonthefly for example). In this situation, the same data points may beexaminedseveraltimes,anditisverycommoninonline algorithms to simulate an i.i.d. sampling of p(x) by cycling over a randomly permuted training set(bottou& Bousquet, 2008). This method works experimentally well inoursettingbut,whenthetrainingsetissmallenough, it is possible to further speed up convergence: In Algorithm,thematrices A t and B t carryalltheinformation fromthepastcoefficients α,...,α t.supposethatattime t 0,asignal xisdrawnandthevector α t0 iscomputed.if thesamesignal xisdrawnagainattime t > t 0,onewould liketoremovethe old informationconcerning xfrom A t and B t thatis,write A t A t + α t α T t α t0 α T t 0 for instance. When dealing with large training sets, it is impossibletostoreallthepastcoefficients α t0,butitisstill possible to partially exploit the same idea, by carrying in A t and B t theinformationfromthecurrentandprevious epochs(cycles through the data) only. Mini-Batch Extension. In practice, we can improve the convergencespeedofouralgorithmbydrawing η > signalsateachiterationinsteadofasingleone,whichis a classical heuristic in stochastic gradient descent algorithms. Letusdenote x t,,...,x t,η thesignalsdrawnat iteration t.wecanthenreplacethelines 5and 6ofAlgorithmby { At βa t + η i= α t,iα T t,i, B t βb t + η i= xαt t,i, () where βischosensothat β = θ+ η θ+,where θ = tηif t < ηand η 2 +t ηif t η,whichiscompatiblewithour convergence analysis. Purging the Dictionary from Unused Atoms. Every dictionary learning technique sometimes encounters situations wheresomeofthedictionaryatomsarenever(orveryseldom) used, which happens typically with a very bad intialization. A common practice is to replace them during the

optimization by elements of the training set, which solves in practice this problem in most cases. 4. Convergence Analysis Although our algorithm is relatively simple, its stochastic nature and the non-convexity of the objective function make the proof of its convergence to a stationary point somewhat involved. The main tools used in our proofs are the convergence of empirical processes(van der Vaart, 998) and, following Bottou(998), the convergence of quasi-martingales(fisk, 965). Our analysis is limited to thebasicversionofthealgorithm,althoughitcaninprinciple be carried over to the optimized version discussed in Section 3.4. Because of space limitations, we will restrict ourselves to the presentation of our main results and asketchoftheirproofs,whichwillbepresentedindetails elsewhere, and first the(reasonable) assumptions under which our analysis holds. 4.. Assumptions (A) The data admits a bounded probability density p with compact support K. Assuming a compact support forthedataisnaturalinaudio,image,andvideoprocessing applications, where it is imposed by the data acquisition process. (B)Thequadraticsurrogatefunctions ˆf t arestrictly convex with lower-bounded Hessians. We assume that the smallest eigenvalue of the semi-definite positive matrix t A tdefinedinalgorithmisgreaterthanorequal toanon-zeroconstant κ (making A t invertibleand ˆf t strictly convex with Hessian lower-bounded). This hypothesis is in practice verified experimentally after a few iterations of the algorithm when the initial dictionary is reasonable, consisting for example of a few elements from the training set, or any one of the off-the-shelf dictionaries, such as DCT(bases of cosines products) or wavelets. Note thatitiseasytoenforcethisassumptionbyaddingaterm κ 2 D 2 F totheobjectivefunction,whichisequivalentin practicetoreplacingthepositivesemi-definitematrix t A t by t A t + κ I.Wehaveomittedforsimplicitythispenalization in our analysis. (C) A sufficient uniqueness condition of the sparse codingsolutionisverified:givensome x K,where Kis thesupportof p,and D C,letusdenoteby Λthesetof indices jsuchthat d T j (x Dα ) = λ,where α isthe solutionofeq.(2). Weassumethatthereexists κ 2 > 0 suchthat,forall xin Kandalldictionaries Dinthesubset S of C considered by our algorithm, the smallest eigenvalueof D T Λ D Λisgreaterthanorequalto κ 2.Thismatrix is thus invertible and classical results(fuchs, 2005) ensure the uniqueness of the sparse coding solution. It is of course easy to build a dictionary D for which this assumption fails. However,having D T Λ D ΛinvertibleisacommonassumptioninlinearregressionandinmethodssuchastheLARS algorithmaimedatsolvingeq.(2)(efronetal.,2004).it is also possible to enforce this condition using an elastic netpenalization(zou&hastie,2005),replacing α by α + κ2 2 α 2 2andthusimprovingthenumericalstability of homotopy algorithms such as LARS. Again, we have omitted this penalization for simplicity. 4.2. Main Results and Proof Sketches Givenassumptions(A)to(C),letusnowshowthatour algorithm converges to a stationary point of the objective function. Proposition (convergenceof f(d t )andofthesurrogatefunction). Let ˆft denotethesurrogatefunction definedineq.(7).underassumptions(a)to(c): ˆf t (D t )convergesa.s.; f(d t ) ˆf t (D t )convergesa.s.to 0;and f(d t )convergesa.s. Proofsktech: Thefirststepintheproofistoshowthat D t D t = O ( ) t which,althoughitdoesnotensure theconvergenceof D t,ensurestheconvergenceoftheseries t= D t D t 2 F,aclassicalconditioningradient descent convergence proofs(bertsekas, 999). In turn, thisreducestoshowingthat D t minimizesaparametrized quadraticfunctionover Cwithparameters t A tand t B t, then showing that the solution is uniformly Lipschitz with respect to these parameters, borrowing some ideas from perturbation theory(bonnans& Shapiro, 998). At this point, and following Bottou(998), proving the convergenceofthesequence ˆf t (D t )amountstoshowingthatthe stochastic positive process u t = ˆft (D t ) 0, (2) isaquasi-martingale.todoso,denotingby F t thefiltration of the past information, a theorem by Fisk(965) states thatifthepositivesum t= E[max(E[u t+ u t F t ],0)] converges,then u t isaquasi-martingalewhichconverges with probability one. Using some results on empirical processes(van der Vaart, 998, Chap. 9.2, Donsker Theorem), we obtain a bound that ensures the convergence ofthisseries. Itfollowsfromtheconvergenceof u t that f t (D t ) ˆf t (D t )convergestozerowithprobabilityone. Then, a classical theorem from perturbation theory(bonnans&shapiro,998,theorem4.)showsthat l(x,d) is C. This,allowsustousealastresultonempirical processesensuringthat f(d t ) ˆf t (D t )convergesalmost surelyto 0.Therefore f(d t )convergesaswellwithprobability one.

Proposition 2 (convergence to a stationary point). Underassumptions(A)to(C), D t isasymptoticallycloseto the set of stationary points of the dictionary learning problem with probability one. Proofsktech:Thefirststepintheproofistoshowusing classical analysis tools that, given assumptions(a) to(c), fis C withalipschitzgradient. Considering Ãand B twoaccumulationpointsof t A tand t B trespectively,we candefinethecorrespondingsurrogatefunction ˆf such thatforall Din C, ˆf (D) = 2 Tr(DT DÃ) Tr(D T B), anditsoptimum D on C.Thenextstepconsistsofshowingthat ˆf (D ) = f(d )andthat f(d )isin thenormalconeoftheset C thatis, D isastationary point of the dictionary learning problem (Borwein & Lewis, 2006). 5. Experimental Validation In this section, we present experiments on natural images to demonstrate the efficiency of our method. 5.. Performance evaluation Forourexperiments,wehaverandomlyselected.25 0 6 patches from images in the Berkeley segmentation dataset, whichisastandardimagedatabase; 0 6 ofthesearekept for training, and the rest for testing. We used these patches tocreatethreedatasets A, B,and Cwithincreasingpatch and dictionary sizes representing various typical settings in image processing applications: Data Signalsize m Nb kofatoms Type A 8 8 = 64 256 b&w B 2 2 3 = 432 52 color C 6 6 = 256 024 b&w Wehavenormalizedthepatchestohaveunit l 2 -normand usedtheregularizationparameter λ =.2/ minallof ourexperiments.the / mtermisaclassicalnormalization factor(bickel et al., 2007), and the constant.2 has been experimentally shown to yield reasonable sparsities (about 0 nonzero coefficients) in these experiments. We have implemented the proposed algorithm in C++ with a Matlab interface. All the results presented in this section use the mini-batch refinement from Section 3.4 since this hasshownempiricallytoimprovespeedbyafactorof0 ormore.thisrequirestotunetheparameter η,thenumber of signals drawn at each iteration. Trying different powers of2forthisvariablehasshownthat η = 256wasagood choice(lowest objective function values on the training set empirically, this setting also yields the lowest values on thetestset),butvaluesof28andand52havegivenvery similar performances. Our implementation can be used in both the online setting itisintendedfor,andinaregularbatchmodewhereit uses the entire dataset at each iteration(corresponding to themini-batchversionwith η = n).wehavealsoimplemented a first-order stochastic gradient descent algorithm that shares most of its code with our algorithm, except for the dictionary update step. This setting allows us to draw meaningful comparisons between our algorithm and its batch and stochastic gradient alternatives, which would have been difficult otherwise. For example, comparing our algorithm to the Matlab implementation of the batch approach from(lee et al., 2007) developed by its authors wouldhavebeenunfairsinceourc++programhasabuiltin speed advantage. Although our implementation is multithreaded, our experiments have been run for simplicity on a single-cpu, single-core 2.4Ghz machine. To measure and compare the performances of the three tested methods, we haveplottedthevalueoftheobjectivefunctiononthetest set,actingasasurrogateoftheexpectedcost,asafunction of the corresponding training time. Online vs Batch. Figure (top) compares the online and batch settings of our implementation. The full training set consistsof 0 6 samples. Theonlineversionofouralgorithmdrawssamplesfromtheentireset,andwehaverun itsbatchversiononthefulldatasetaswellassubsetsofsize 0 4 and 0 5 (seefigure).theonlinesettingsystematically outperforms its batch counterpart for every training set size and desired precision. We use a logarithmic scale for the computation time, which shows that in many situations, the difference in performance can be dramatic. Similar experiments have given similar results on smaller datasets. Comparison with Stochastic Gradient Descent. Our experiments have shown that obtaining good performance with stochastic gradient descent requires using both the mini-batch heuristic and carefully choosing the learning rate ρ. To give the fairest comparison possible, we have thus optimized these parameters, sampling η values among powersof2(asbefore)and ρvaluesamongpowersof0. Thecombinationofvalues ρ = 0 4, η = 52givesthe best results on the training and test data for stochastic gradient descent. Figure (bottom) compares our method with stochastic gradient descent for different ρ values around 0 4 andafixedvalueof η = 52. Weobservethatthe largerthevalueof ρis,thebettertheeventualvalueofthe objective function is after many iterations, but the longer it will take to achieve a good precision. Although our method performs better at such high-precision settings for dataset C,itappearsthat,ingeneral,foradesiredprecisionanda particular dataset, it is possible to tune the stochastic gradient descent algorithm to achieve a performance similar to that of our algorithm. Note that both stochastic gradient descent and our method only start decreasing the objective function value after a few iterations. Slightly better results could be obtained by using smaller gradient steps

0.495 0.49 0.485 0.48 0.475 0.47 0.465 0.46 0.455 0.495 0.49 0.485 0.48 0.475 0.47 0.465 0.46 0.455 Evaluation set A Batch n=0 4 Batch n=0 5 Batch n=0 6 0 0 0 0 0 2 0 3 0 4 Evaluation set A SG ρ=5.0 3 SG ρ=0 4 SG ρ=2.0 4 0 0 0 0 0 2 0 3 0 4 0.2 0. 0. 0.09 0.08 0.07 0.2 0. Evaluation set B Batch n=0 4 Batch n=0 5 Batch n=0 6 0 0 0 0 0 2 0 3 0 4 0. 0.09 0.08 0.07 Evaluation set B SG ρ=5.0 3 SG ρ=0 4 SG ρ=2.0 4 0 0 0 0 0 2 0 3 0 4 0.087 0.086 0.085 0.084 0.083 0.087 0.086 0.085 0.084 0.083 Batch n=0 4 Batch n=0 5 Batch n=0 6 Evaluation set C 0 0 0 0 0 2 0 3 0 4 Evaluation set C SG ρ=5.0 3 SG ρ=0 4 SG ρ=2.0 4 0 0 0 0 0 2 0 3 0 4 Figure. Top: Comparison between online and batch learning for various training set sizes. Bottom: Comparison between our method and stochastic gradient(sg) descent with different learning rates ρ. In both cases, the value of the objective function evaluated on the test set is reported as a function of computation time on a logarithmic scale. Values of the objective function greater than its initial value are truncated. during the first iterations, using a learning rate of the form ρ/(t + t 0 )forthestochasticgradientdescent,andinitializing A 0 = t 0 Iand B 0 = t 0 D 0 forthematrices A t and B t, where t 0 isanewparameter. 5.2. Application to Inpainting Our last experiment demonstrates that our algorithm can be used for a difficult large-scale image processing task, namely, removing the text(inpainting) from the damaged 2-Megapixel image of Figure 2. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256elementsfromtheroughly 7 0 6 undamaged 2 2colorpatchesintheimagewithtwoepochsin about 500 seconds on a 2.4GHz machine with eight cores. Once the dictionary has been learned, the text is removed using the sparse coding technique for inpainting of Mairal etal. (2008). Ourintenthereisofcoursenottoevaluate our learning procedure in inpainting tasks, which would require a thorough comparison with state-the-art techniques on standard datasets. Instead, we just wish to demonstrate thattheproposedmethodcanindeedbeappliedtoarealistic, non-trivial image processing task on a large image.indeed,tothebestofourknowledge,thisisthefirst time that dictionary learning is used for image restoration on such large-scale data. For comparison, the dictionaries used for inpainting in the state-of-the-art method of Mairal etal.(2008)arelearned(inbatchmode)ononly200,000 patches. 6. Discussion Wehaveintroducedinthispaperanewstochasticonlinealgorithm for learning dictionaries adapted to sparse coding tasks, and proven its convergence. Preliminary experiments demonstrate that it is significantly faster than batch alternatives on large datasets that may contain millions of training examples, yet it does not require learning rate tuning like regular stochastic gradient descent methods. More experiments are of course needed to better assess the promise of this approach in image restoration tasks such as denoising, deblurring, and inpainting. Beyond this, we plan to use the proposed learning framework for sparse coding in computationally demanding video restoration tasks(protter&elad,2009),withdynamicdatasetswhosesizeisnot fixed, and also plan to extend this framework to different loss functions to address discriminative tasks such as image classification(mairal et al., 2009), which are more sensitive to overfitting than reconstructive ones, and various matrix factorization tasks, such as non-negative matrix factorization with sparseness constraints and sparse principal component analysis. Acknowledgments ThispaperwassupportedinpartbyANRundergrant MGA. The work of Guillermo Sapiro is partially supported byonr,nga,nsf,aro,anddarpa.

Online Dictionary Learning for Sparse Coding Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 54, 3736 3745. Fisk, D. (965). Quasi-martingale. Transactions of the American Mathematical Society, 359 388. Friedman, J., Hastie, T., Ho lfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Statistics,, 302 332. Fu, W. (998). Penalized Regressions: The Bridge Versus the Lasso. Journal of computational and graphical statistics, 7, 397 46. Figure 2. Inpainting example on a 2-Megapixel image. Top: Damaged and restored images. Bottom: Zooming on the damaged and restored images. (Best seen in color) Fuchs, J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions Information Theory, 5, 360 3608. References Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2007). Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 9, 80 808. Aharon, M., & Elad, M. (2008). Sparse and redundant modeling of image content using an image-signaturedictionary. SIAM Imaging Sciences,, 228 247. Mairal, J., Elad, M., & Sapiro, G. (2008). Sparse representation for color image restoration. IEEE Transactions Image Processing, 7, 53 69. Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions Signal Processing, 54, 43-4322 Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). Supervised dictionary learning. Advances in Neural Information Processing Systems, 2, 033 040. Bertsekas, D. (999). Nonlinear programming. Athena Scientific Belmont, Mass. Bickel, P., Ritov, Y., & Tsybakov, A. (2007). Simultaneous analysis of Lasso and Dantzig selector. preprint. Bonnans, J., & Shapiro, A. (998). Optimization problems with perturbation: A guided tour. SIAM Review, 40, 202 227. Borwein, J., & Lewis, A. (2006). Convex analysis and nonlinear optimization: theory and examples. Springer. Bottou, L. (998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning and neural networks. Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in Neural Information Processing Systems, 20, 6 68. Chen, S., Donoho, D., & Saunders, M. (999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20, 33 6. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407 499. Mallat, S. (999). A wavelet tour of signal processing, second edition. Academic Press, New York. Olshausen, B. A., & Field, D. J. (997). Sparse coding with an overcomplete basis set: A strategy employed by V? Vision Research, 37, 33 3325. Osborne, M., Presnell, B., & Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20, 389 403. Protter, M., & Elad, M. (2009). Image sequence denoising via sparse and redundant representations. IEEE Transactions Image Processing, 8, 27 36. Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the 26th International Conference on Machine Learning, 759 766. Tibshirani, R. (996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 67, 267 288. Van der Vaart, A. (998). Asymptotic Statistics. Cambridge University Press. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67, 30 320.