Online Dictionary Learning for Sparse Coding
|
|
|
- Walter Hart
- 10 years ago
- Views:
Transcription
1 Julien Mairal Francis Bach INRIA, 45rued Ulm75005Paris,France Jean Ponce EcoleNormaleSupérieure, 45rued Ulm75005Paris,France Guillermo Sapiro University of Minnesota- Department of Electrical and Computer Engineering, 200 Union Street SE, Minneapolis, USA Abstract Sparse coding that is, modelling data vectors as sparse linear combinations of basis elements is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be very effective for signal reconstruction and classification in the audio and image processing domains. This paper proposes a new online optimization algorithm for dictionary learning, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples. A proof of convergence is presented, along with experiments with natural images demonstrating that it leads to faster performance and better dictionaries than classical batch algorithms for both small and large datasets.. Introduction Thelineardecompositionofasignalusingafewatomsof a learned dictionary instead of a predefined one based on wavelets(mallat, 999) for example has recently led to state-of-the-art results for numerous low-level image processing tasks such as denoising(elad& Aharon, 2006) as well as higher-level tasks such as classification(raina et al., 2007; Mairal et al., 2009), showing that sparse learned models are well adapted to natural signals. Un- WILLOWProject, Laboratoired Informatiquedel Ecole Normale Supérieure, ENS/INRIA/CNRS UMR AppearinginProceedingsofthe 26 th InternationalConference on Machine Learning, Montreal, Canada, Copyright 2009 by the author(s)/owner(s). like decompositions based on principal component analysisanditsvariants,thesemodelsdonotimposethatthe basis vectors be orthogonal, allowing more flexibility to adapt the representation to the data. While learning the dictionary has proven to be critical to achieve(or improve upon) state-of-the-art results, effectively solving the corresponding optimization problem is a significant computational challenge, particularly in the context of the largescale datasets involved in image processing tasks, that may include millions of training samples. Addressing this challengeisthetopicofthispaper. Concretely,considerasignal xin R m.wesaythatitadmitsasparseapproximationoveradictionary Din R m k, with kcolumnsreferredtoasatoms,whenonecanfinda linearcombinationofa few atomsfrom Dthatis close to the signal x. Experiments have shown that modelling a signal with such a sparse decomposition(sparse coding) is very effective in many signal processing applications(chen et al., 999). For natural images, predefined dictionaries based on various types of wavelets(mallat, 999) have been used for this task. However, learning the dictionary instead of using off-the-shelf bases has been shown to dramatically improve signal reconstruction(elad& Aharon, 2006). Although some of the learned dictionary elements may sometimes look like wavelets(or Gabor filters), they aretunedtotheinputimagesorsignals,leadingtomuch better results in practice. Most recent algorithms for dictionary learning(olshausen &Field, 997; Aharonetal., 2006; Leeetal., 2007) are second-order iterative batch procedures, accessing the wholetrainingsetateachiterationinordertominimizea cost function under some constraints. Although they have shown experimentally to be much faster than first-order gradient descent methods(lee et al., 2007), they cannot effectively handle very large training sets(bottou& Bousquet, 2008), or dynamic training data changing over time,
2 such as video sequences. To address these issues, we propose an online approach that processes one element(or a smallsubset)ofthetrainingsetatatime.thisisparticularly important in the context of image and video processing(protter&elad,2009),whereitiscommontolearn dictionaries adapted to small patches, with training data that may include several millions of these patches(roughly oneperpixelandperframe). Inthissetting,onlinetechniques based on stochastic approximations are an attractive alternative to batch methods(bottou, 998). For example, first-order stochastic gradient descent with projections on the constraint set is sometimes used for dictionary learning(seeaharonandelad(2008)forinstance). Weshow inthispaperthatitispossibletogofurtherandexploitthe specificstructureofsparsecodinginthedesignofanoptimization procedure dedicated to the problem of dictionary learning, with low memory consumption and lower computational cost than classical second-order batch algorithms and without the need of explicit learning rate tuning. As demonstrated by our experiments, the algorithm scales up gracefully to large datasets with millions of training samples, and it is usually faster than more standard methods... Contributions This paper makes three main contributions. WecastinSection2thedictionarylearningproblemas the optimization of a smooth nonconvex objective function over a convex set, minimizing the(desired) expected cost when the training set size goes to infinity. WeproposeinSection3aniterativeonlinealgorithmthat solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the setofconstraints. ThismethodisshowninSection4to converge with probability one to a stationary point of the cost function. As shown experimentally in Section 5, our algorithm is significantly faster than previous approaches to dictionary learning on both small and large datasets of natural images. To demonstrate that it is adapted to difficult, largescale image-processing tasks, we learn a dictionary on a 2-Megapixel photograph and use it for inpainting. 2. Problem Statement Classical dictionary learning techniques (Olshausen & Field,997;Aharonetal.,2006;Leeetal.,2007)consider afinitetrainingsetofsignals X = [x,...,x n ]in R m n and optimize the empirical cost function f n (D) = n l(x i,d), () n i= where Din R m k isthedictionary,eachcolumnrepresentingabasisvector,and lisalossfunctionsuchthat l(x,d) shouldbesmallif Dis good atrepresentingthesignal x. Thenumberofsamples nisusuallylarge,whereasthesignaldimension misrelativelysmall,forexample, m = 00 for 0 0imagepatches,and n 00,000fortypical image processing applications. In general, we also have k n(e.g., k = 200for n = 00,000),andeachsignal onlyusesafewelementsof Dinitsrepresentation.Note that, in this setting, overcomplete dictionaries with k > m areallowed.asothers(see(leeetal.,2007)forexample), wedefine l(x,d)astheoptimalvalueofthe l -sparsecoding problem: l(x,d) = min α R k 2 x Dα λ α, (2) where λisaregularizationparameter. 2 Thisproblemis alsoknownasbasispursuit(chenetal.,999),orthe Lasso(Tibshirani, 996). Itiswellknownthatthe l penaltyyieldsasparsesolutionfor α,butthereisnoanalytic link between the value of λ and the corresponding effectivesparsity α 0.Toprevent Dfrombeingarbitrarily large(which would lead to arbitrarily small values of α),itiscommontoconstrainitscolumns (d j ) k j= tohave an l 2 normlessthanorequaltoone. Wewillcall Cthe convex set of matrices verifying this constraint: C = {D R m k s.t. j =,...,k, d T j d j }. (3) Note that the problem of minimizing the empirical cost f n (D)isnotconvexwithrespectto D. Itcanberewrittenasajointoptimizationproblemwithrespecttothedictionary Dandthecoefficients α = [α,...,α n ]ofthe sparse decomposition, which is not jointly convex, but convexwithrespecttoeachofthetwovariables Dand αwhen theotheroneisfixed: min D C,α R k n n n i= ( 2 x i Dα i λ α i ). (4) A natural approach to solving this problem is to alternate between the two variables, minimizing over one while keepingtheotheronefixed, asproposedbyleeetal. (2007)(seealsoAharonetal. (2006),whouse l 0 rather than l penalties, for related approaches). 3 Since the computation of α dominates the cost of each iteration, a second-order optimization technique can be used in this casetoaccuratelyestimate Dateachstepwhen αisfixed. As pointed out by Bottou and Bousquet(2008), however, one is usually not interested in a perfect minimization of 2 The l pnormofavector xin R m isdefined,for p,by x p = ( P m i= x[i] p ) /p. Followingtradition,wedenoteby x 0thenumberofnonzeroelementsofthevector x.this l 0 sparsitymeasureisnotatruenorm. 3 Inoursetting,asin(Leeetal.,2007),weusetheconvex l norm, that has empirically proven to be better behaved in general thanthe l 0pseudo-normfordictionarylearning.
3 theempiricalcost f n (D),butintheminimizationofthe expected cost f(d) = E x [l(x,d)] = lim n f n(d) a.s., (5) where the expectation(which is assumed finite) is taken relative to the(unknown) probability distribution p(x) of the data. 4 Inparticular,givenafinitetrainingset,oneshould not spend too much effort on accurately minimizing the empiricalcost,sinceitisonlyanapproximationoftheexpected cost. Bottou and Bousquet(2008) have further shown both theoretically and experimentally that stochastic gradient algorithms, whose rate of convergence is not good in conventional optimization terms, may in fact in certain settings be the fastest in reaching a solution with low expected cost. With large training sets, classical batch optimization techniques may indeed become impractical in terms of speed or memory requirements. In the case of dictionary learning, classical projected firstorder stochastic gradient descent(as used by Aharon and Elad(2008) for instance) consists of a sequence of updates of D: D t = Π C [D t ρ ] t Dl(x t,d t ), (6) where ρisthegradientstep, Π C istheorthogonalprojectoron C,andthetrainingset x,x 2,...arei.i.d.samples of the(unknown) distribution p(x). As shown in Section 5, we have observed that this method can be competitive compared to batch methods with large training sets, when agoodlearningrate ρisselected. The dictionary learning method we present in the next section falls into the class of online algorithms based on stochastic approximations, processing one sample at a time, but exploits the specific structure of the problem to efficiently solve it. Contrary to classical first-order stochastic gradient descent, it does not require explicit learning rate tuning and minimizes a sequentially quadratic local approximations of the expected cost. 3. Online Dictionary Learning Wepresentinthissectionthebasiccomponentsofouronline algorithm for dictionary learning(sections ), as well as two minor variants which speed up our implementation(section 3.4). 3.. Algorithm Outline Our algorithm is summarized in Algorithm. Assuming the training set composed of i.i.d. samples of a distribu- 4 Weuse a.s. (almostsure)todenoteconvergencewithprobability one. Algorithm Online dictionary learning. Require: x R m p(x)(randomvariableandanalgorithmtodrawi.i.dsamplesof p), λ R(regularization parameter), D 0 R m k (initialdictionary), T(number of iterations). : A 0 0, B 0 0(resetthe past information). 2: for t = to Tdo 3: Draw x t from p(x). 4: Sparse coding: compute using LARS α t = arg min α R 2 x t D t α λ α. (8) k 5: A t A t + α t α T t. 6: B t B t + x t α T t. 7: Compute D t usingalgorithm2,with D t aswarm restart, so that D t = arg min D C = arg min D C t t t i= 8: endfor 9: Return D T (learneddictionary). 2 x i Dα i λ α i, ( 2 Tr(DT DA t ) Tr(D T B t ) ). tion p(x),itsinnerloopdrawsoneelement x t atatime, as in stochastic gradient descent, and alternates classical sparsecodingstepsforcomputingthedecomposition α t of x t overthedictionary D t obtainedatthepreviousiteration, with dictionary update steps where the new dictionary D t iscomputedbyminimizingover Cthefunction ˆf t (D) = t t i= (9) 2 x i Dα i λ α i, (7) wherethevectors α i arecomputedduringtheprevious steps of the algorithm. The motivation behind our approach is twofold: Thequadraticfunction ˆf t aggregatesthepastinformation computed during the previous steps of the algorithm, namelythevectors α i,anditiseasytoshowthatitupperboundstheempiricalcost f t (D t )fromeq.(). One keyaspectoftheconvergenceanalysiswillbetoshowthat ˆf t (D t )and f t (D t )convergesalmostsurelytothesame limitandthus ˆf t actsasasurrogatefor f t. Since ˆf t iscloseto ˆf t, D t canbeobtainedefficiently using D t aswarmrestart Sparse Coding ThesparsecodingproblemofEq.(2)withfixeddictionaryisan l -regularizedlinearleast-squaresproblem. A
4 Algorithm 2 Dictionary Update. Require: D = [d,...,d k ] R m k (inputdictionary), A = [a,...,a k ] R k k = t i= α iα T i, B = [b,...,b k ] R m k = t i= x iα T i. : repeat 2: for j = to kdo 3: Update the j-th column to optimize for(9): u j A jj (b j Da j ) + d j. d j max( u j 2,) u j. 4: endfor 5: until convergence 6: Return D(updated dictionary). (0) number of recent methods for solving this type of problems are based on coordinate descent with soft thresholding(fu, 998; Friedman et al., 2007). When the columns of the dictionary have low correlation, these simple methods have proven to be very efficient. However, the columns of learned dictionaries are in general highly correlated, and we have empirically observed that a Cholesky-based implementation of the LARS-Lasso algorithm, an homotopy method(osborneetal.,2000;efronetal.,2004)thatprovides the whole regularization path that is, the solutions forallpossiblevaluesof λ,canbeasfastasapproaches based on soft thresholding, while providing the solution with a higher accuracy Dictionary Update Our algorithm for updating the dictionary uses blockcoordinate descent with warm restarts, and one of its main advantages is that it is parameter-free and does not require anylearningratetuning,whichcanbedifficultinaconstrained optimization setting. Concretely, Algorithm 2 sequentially updates each column of D. Using some simple algebra,itiseasytoshowthateq.(0)givesthesolution ofthedictionaryupdate(9)withrespecttothe j-thcolumn d j,whilekeepingtheotheronesfixedundertheconstraint d T j d j. Sincethisconvexoptimizationproblemadmits separable constraints in the updated blocks(columns), convergence to a global optimum is guaranteed(bertsekas, 999).Inpractice,sincethevectors α i aresparse,thecoefficientsofthematrix Aareingeneralconcentratedonthe diagonal, which makes the block-coordinate descent more efficient. 5 Sinceouralgorithmusesthevalueof D t asa 5 Notethatthisassumptiondoesnotexactlyhold:Tobemore exact,ifagroupofcolumnsin Darehighlycorrelated,thecoefficients of the matrix A can concentrate on the corresponding principal submatrices of A. warmrestartforcomputing D t,asingleiterationhasempirically been found to be enough. Other approaches have beenproposedtoupdate D,forinstance,Leeetal.(2007) suggestusinganewtonmethodonthedualofeq.(9),but thisrequiresinvertingak kmatrixateachnewtoniteration, which is impractical for an online algorithm Optimizing the Algorithm Wehavepresentedsofarthebasicbuildingblocksofour algorithm. This section discusses simple improvements that significantly enhance its performance. Handling Fixed-Size Datasets. In practice, although it maybeverylarge,thesizeofthetrainingsetisoftenfinite(ofcoursethismaynotbethecase,whenthedata consistsofavideostreamthatmustbetreatedonthefly for example). In this situation, the same data points may beexaminedseveraltimes,anditisverycommoninonline algorithms to simulate an i.i.d. sampling of p(x) by cycling over a randomly permuted training set(bottou& Bousquet, 2008). This method works experimentally well inoursettingbut,whenthetrainingsetissmallenough, it is possible to further speed up convergence: In Algorithm,thematrices A t and B t carryalltheinformation fromthepastcoefficients α,...,α t.supposethatattime t 0,asignal xisdrawnandthevector α t0 iscomputed.if thesamesignal xisdrawnagainattime t > t 0,onewould liketoremovethe old informationconcerning xfrom A t and B t thatis,write A t A t + α t α T t α t0 α T t 0 for instance. When dealing with large training sets, it is impossibletostoreallthepastcoefficients α t0,butitisstill possible to partially exploit the same idea, by carrying in A t and B t theinformationfromthecurrentandprevious epochs(cycles through the data) only. Mini-Batch Extension. In practice, we can improve the convergencespeedofouralgorithmbydrawing η > signalsateachiterationinsteadofasingleone,whichis a classical heuristic in stochastic gradient descent algorithms. Letusdenote x t,,...,x t,η thesignalsdrawnat iteration t.wecanthenreplacethelines 5and 6ofAlgorithmby { At βa t + η i= α t,iα T t,i, B t βb t + η i= xαt t,i, () where βischosensothat β = θ+ η θ+,where θ = tηif t < ηand η 2 +t ηif t η,whichiscompatiblewithour convergence analysis. Purging the Dictionary from Unused Atoms. Every dictionary learning technique sometimes encounters situations wheresomeofthedictionaryatomsarenever(orveryseldom) used, which happens typically with a very bad intialization. A common practice is to replace them during the
5 optimization by elements of the training set, which solves in practice this problem in most cases. 4. Convergence Analysis Although our algorithm is relatively simple, its stochastic nature and the non-convexity of the objective function make the proof of its convergence to a stationary point somewhat involved. The main tools used in our proofs are the convergence of empirical processes(van der Vaart, 998) and, following Bottou(998), the convergence of quasi-martingales(fisk, 965). Our analysis is limited to thebasicversionofthealgorithm,althoughitcaninprinciple be carried over to the optimized version discussed in Section 3.4. Because of space limitations, we will restrict ourselves to the presentation of our main results and asketchoftheirproofs,whichwillbepresentedindetails elsewhere, and first the(reasonable) assumptions under which our analysis holds. 4.. Assumptions (A) The data admits a bounded probability density p with compact support K. Assuming a compact support forthedataisnaturalinaudio,image,andvideoprocessing applications, where it is imposed by the data acquisition process. (B)Thequadraticsurrogatefunctions ˆf t arestrictly convex with lower-bounded Hessians. We assume that the smallest eigenvalue of the semi-definite positive matrix t A tdefinedinalgorithmisgreaterthanorequal toanon-zeroconstant κ (making A t invertibleand ˆf t strictly convex with Hessian lower-bounded). This hypothesis is in practice verified experimentally after a few iterations of the algorithm when the initial dictionary is reasonable, consisting for example of a few elements from the training set, or any one of the off-the-shelf dictionaries, such as DCT(bases of cosines products) or wavelets. Note thatitiseasytoenforcethisassumptionbyaddingaterm κ 2 D 2 F totheobjectivefunction,whichisequivalentin practicetoreplacingthepositivesemi-definitematrix t A t by t A t + κ I.Wehaveomittedforsimplicitythispenalization in our analysis. (C) A sufficient uniqueness condition of the sparse codingsolutionisverified:givensome x K,where Kis thesupportof p,and D C,letusdenoteby Λthesetof indices jsuchthat d T j (x Dα ) = λ,where α isthe solutionofeq.(2). Weassumethatthereexists κ 2 > 0 suchthat,forall xin Kandalldictionaries Dinthesubset S of C considered by our algorithm, the smallest eigenvalueof D T Λ D Λisgreaterthanorequalto κ 2.Thismatrix is thus invertible and classical results(fuchs, 2005) ensure the uniqueness of the sparse coding solution. It is of course easy to build a dictionary D for which this assumption fails. However,having D T Λ D ΛinvertibleisacommonassumptioninlinearregressionandinmethodssuchastheLARS algorithmaimedatsolvingeq.(2)(efronetal.,2004).it is also possible to enforce this condition using an elastic netpenalization(zou&hastie,2005),replacing α by α + κ2 2 α 2 2andthusimprovingthenumericalstability of homotopy algorithms such as LARS. Again, we have omitted this penalization for simplicity Main Results and Proof Sketches Givenassumptions(A)to(C),letusnowshowthatour algorithm converges to a stationary point of the objective function. Proposition (convergenceof f(d t )andofthesurrogatefunction). Let ˆft denotethesurrogatefunction definedineq.(7).underassumptions(a)to(c): ˆf t (D t )convergesa.s.; f(d t ) ˆf t (D t )convergesa.s.to 0;and f(d t )convergesa.s. Proofsktech: Thefirststepintheproofistoshowthat D t D t = O ( ) t which,althoughitdoesnotensure theconvergenceof D t,ensurestheconvergenceoftheseries t= D t D t 2 F,aclassicalconditioningradient descent convergence proofs(bertsekas, 999). In turn, thisreducestoshowingthat D t minimizesaparametrized quadraticfunctionover Cwithparameters t A tand t B t, then showing that the solution is uniformly Lipschitz with respect to these parameters, borrowing some ideas from perturbation theory(bonnans& Shapiro, 998). At this point, and following Bottou(998), proving the convergenceofthesequence ˆf t (D t )amountstoshowingthatthe stochastic positive process u t = ˆft (D t ) 0, (2) isaquasi-martingale.todoso,denotingby F t thefiltration of the past information, a theorem by Fisk(965) states thatifthepositivesum t= E[max(E[u t+ u t F t ],0)] converges,then u t isaquasi-martingalewhichconverges with probability one. Using some results on empirical processes(van der Vaart, 998, Chap. 9.2, Donsker Theorem), we obtain a bound that ensures the convergence ofthisseries. Itfollowsfromtheconvergenceof u t that f t (D t ) ˆf t (D t )convergestozerowithprobabilityone. Then, a classical theorem from perturbation theory(bonnans&shapiro,998,theorem4.)showsthat l(x,d) is C. This,allowsustousealastresultonempirical processesensuringthat f(d t ) ˆf t (D t )convergesalmost surelyto 0.Therefore f(d t )convergesaswellwithprobability one.
6 Proposition 2 (convergence to a stationary point). Underassumptions(A)to(C), D t isasymptoticallycloseto the set of stationary points of the dictionary learning problem with probability one. Proofsktech:Thefirststepintheproofistoshowusing classical analysis tools that, given assumptions(a) to(c), fis C withalipschitzgradient. Considering Ãand B twoaccumulationpointsof t A tand t B trespectively,we candefinethecorrespondingsurrogatefunction ˆf such thatforall Din C, ˆf (D) = 2 Tr(DT DÃ) Tr(D T B), anditsoptimum D on C.Thenextstepconsistsofshowingthat ˆf (D ) = f(d )andthat f(d )isin thenormalconeoftheset C thatis, D isastationary point of the dictionary learning problem (Borwein & Lewis, 2006). 5. Experimental Validation In this section, we present experiments on natural images to demonstrate the efficiency of our method. 5.. Performance evaluation Forourexperiments,wehaverandomlyselected patches from images in the Berkeley segmentation dataset, whichisastandardimagedatabase; 0 6 ofthesearekept for training, and the rest for testing. We used these patches tocreatethreedatasets A, B,and Cwithincreasingpatch and dictionary sizes representing various typical settings in image processing applications: Data Signalsize m Nb kofatoms Type A 8 8 = b&w B = color C 6 6 = b&w Wehavenormalizedthepatchestohaveunit l 2 -normand usedtheregularizationparameter λ =.2/ minallof ourexperiments.the / mtermisaclassicalnormalization factor(bickel et al., 2007), and the constant.2 has been experimentally shown to yield reasonable sparsities (about 0 nonzero coefficients) in these experiments. We have implemented the proposed algorithm in C++ with a Matlab interface. All the results presented in this section use the mini-batch refinement from Section 3.4 since this hasshownempiricallytoimprovespeedbyafactorof0 ormore.thisrequirestotunetheparameter η,thenumber of signals drawn at each iteration. Trying different powers of2forthisvariablehasshownthat η = 256wasagood choice(lowest objective function values on the training set empirically, this setting also yields the lowest values on thetestset),butvaluesof28andand52havegivenvery similar performances. Our implementation can be used in both the online setting itisintendedfor,andinaregularbatchmodewhereit uses the entire dataset at each iteration(corresponding to themini-batchversionwith η = n).wehavealsoimplemented a first-order stochastic gradient descent algorithm that shares most of its code with our algorithm, except for the dictionary update step. This setting allows us to draw meaningful comparisons between our algorithm and its batch and stochastic gradient alternatives, which would have been difficult otherwise. For example, comparing our algorithm to the Matlab implementation of the batch approach from(lee et al., 2007) developed by its authors wouldhavebeenunfairsinceourc++programhasabuiltin speed advantage. Although our implementation is multithreaded, our experiments have been run for simplicity on a single-cpu, single-core 2.4Ghz machine. To measure and compare the performances of the three tested methods, we haveplottedthevalueoftheobjectivefunctiononthetest set,actingasasurrogateoftheexpectedcost,asafunction of the corresponding training time. Online vs Batch. Figure (top) compares the online and batch settings of our implementation. The full training set consistsof 0 6 samples. Theonlineversionofouralgorithmdrawssamplesfromtheentireset,andwehaverun itsbatchversiononthefulldatasetaswellassubsetsofsize 0 4 and 0 5 (seefigure).theonlinesettingsystematically outperforms its batch counterpart for every training set size and desired precision. We use a logarithmic scale for the computation time, which shows that in many situations, the difference in performance can be dramatic. Similar experiments have given similar results on smaller datasets. Comparison with Stochastic Gradient Descent. Our experiments have shown that obtaining good performance with stochastic gradient descent requires using both the mini-batch heuristic and carefully choosing the learning rate ρ. To give the fairest comparison possible, we have thus optimized these parameters, sampling η values among powersof2(asbefore)and ρvaluesamongpowersof0. Thecombinationofvalues ρ = 0 4, η = 52givesthe best results on the training and test data for stochastic gradient descent. Figure (bottom) compares our method with stochastic gradient descent for different ρ values around 0 4 andafixedvalueof η = 52. Weobservethatthe largerthevalueof ρis,thebettertheeventualvalueofthe objective function is after many iterations, but the longer it will take to achieve a good precision. Although our method performs better at such high-precision settings for dataset C,itappearsthat,ingeneral,foradesiredprecisionanda particular dataset, it is possible to tune the stochastic gradient descent algorithm to achieve a performance similar to that of our algorithm. Note that both stochastic gradient descent and our method only start decreasing the objective function value after a few iterations. Slightly better results could be obtained by using smaller gradient steps
7 Evaluation set A Batch n=0 4 Batch n=0 5 Batch n= Evaluation set A SG ρ=5.0 3 SG ρ=0 4 SG ρ= Evaluation set B Batch n=0 4 Batch n=0 5 Batch n= Evaluation set B SG ρ=5.0 3 SG ρ=0 4 SG ρ= Batch n=0 4 Batch n=0 5 Batch n=0 6 Evaluation set C Evaluation set C SG ρ=5.0 3 SG ρ=0 4 SG ρ= Figure. Top: Comparison between online and batch learning for various training set sizes. Bottom: Comparison between our method and stochastic gradient(sg) descent with different learning rates ρ. In both cases, the value of the objective function evaluated on the test set is reported as a function of computation time on a logarithmic scale. Values of the objective function greater than its initial value are truncated. during the first iterations, using a learning rate of the form ρ/(t + t 0 )forthestochasticgradientdescent,andinitializing A 0 = t 0 Iand B 0 = t 0 D 0 forthematrices A t and B t, where t 0 isanewparameter Application to Inpainting Our last experiment demonstrates that our algorithm can be used for a difficult large-scale image processing task, namely, removing the text(inpainting) from the damaged 2-Megapixel image of Figure 2. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256elementsfromtheroughly undamaged 2 2colorpatchesintheimagewithtwoepochsin about 500 seconds on a 2.4GHz machine with eight cores. Once the dictionary has been learned, the text is removed using the sparse coding technique for inpainting of Mairal etal. (2008). Ourintenthereisofcoursenottoevaluate our learning procedure in inpainting tasks, which would require a thorough comparison with state-the-art techniques on standard datasets. Instead, we just wish to demonstrate thattheproposedmethodcanindeedbeappliedtoarealistic, non-trivial image processing task on a large image.indeed,tothebestofourknowledge,thisisthefirst time that dictionary learning is used for image restoration on such large-scale data. For comparison, the dictionaries used for inpainting in the state-of-the-art method of Mairal etal.(2008)arelearned(inbatchmode)ononly200,000 patches. 6. Discussion Wehaveintroducedinthispaperanewstochasticonlinealgorithm for learning dictionaries adapted to sparse coding tasks, and proven its convergence. Preliminary experiments demonstrate that it is significantly faster than batch alternatives on large datasets that may contain millions of training examples, yet it does not require learning rate tuning like regular stochastic gradient descent methods. More experiments are of course needed to better assess the promise of this approach in image restoration tasks such as denoising, deblurring, and inpainting. Beyond this, we plan to use the proposed learning framework for sparse coding in computationally demanding video restoration tasks(protter&elad,2009),withdynamicdatasetswhosesizeisnot fixed, and also plan to extend this framework to different loss functions to address discriminative tasks such as image classification(mairal et al., 2009), which are more sensitive to overfitting than reconstructive ones, and various matrix factorization tasks, such as non-negative matrix factorization with sparseness constraints and sparse principal component analysis. Acknowledgments ThispaperwassupportedinpartbyANRundergrant MGA. The work of Guillermo Sapiro is partially supported byonr,nga,nsf,aro,anddarpa.
8 Online Dictionary Learning for Sparse Coding Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 54, Fisk, D. (965). Quasi-martingale. Transactions of the American Mathematical Society, Friedman, J., Hastie, T., Ho lfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Statistics,, Fu, W. (998). Penalized Regressions: The Bridge Versus the Lasso. Journal of computational and graphical statistics, 7, Figure 2. Inpainting example on a 2-Megapixel image. Top: Damaged and restored images. Bottom: Zooming on the damaged and restored images. (Best seen in color) Fuchs, J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions Information Theory, 5, References Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2007). Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 9, Aharon, M., & Elad, M. (2008). Sparse and redundant modeling of image content using an image-signaturedictionary. SIAM Imaging Sciences,, Mairal, J., Elad, M., & Sapiro, G. (2008). Sparse representation for color image restoration. IEEE Transactions Image Processing, 7, Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions Signal Processing, 54, Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). Supervised dictionary learning. Advances in Neural Information Processing Systems, 2, Bertsekas, D. (999). Nonlinear programming. Athena Scientific Belmont, Mass. Bickel, P., Ritov, Y., & Tsybakov, A. (2007). Simultaneous analysis of Lasso and Dantzig selector. preprint. Bonnans, J., & Shapiro, A. (998). Optimization problems with perturbation: A guided tour. SIAM Review, 40, Borwein, J., & Lewis, A. (2006). Convex analysis and nonlinear optimization: theory and examples. Springer. Bottou, L. (998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning and neural networks. Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in Neural Information Processing Systems, 20, Chen, S., Donoho, D., & Saunders, M. (999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20, Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, Mallat, S. (999). A wavelet tour of signal processing, second edition. Academic Press, New York. Olshausen, B. A., & Field, D. J. (997). Sparse coding with an overcomplete basis set: A strategy employed by V? Vision Research, 37, Osborne, M., Presnell, B., & Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20, Protter, M., & Elad, M. (2009). Image sequence denoising via sparse and redundant representations. IEEE Transactions Image Processing, 8, Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the 26th International Conference on Machine Learning, Tibshirani, R. (996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 67, Van der Vaart, A. (998). Asymptotic Statistics. Cambridge University Press. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67,
Online Learning for Matrix Factorization and Sparse Coding
Journal of Machine Learning Research 11 (2010) 19-60 Submitted 7/09; Revised 11/09; Published 1/10 Online Learning for Matrix Factorization and Sparse Coding Julien Mairal Francis Bach INRIA - WILLOW Project-Team
Big learning: challenges and opportunities
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,
Part II Redundant Dictionaries and Pursuit Algorithms
Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries
Beyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,
Machine learning challenges for big data
Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Bag of Pursuits and Neural Gas for Improved Sparse Coding
Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Machine learning and optimization for massive data
Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Computational Optical Imaging - Optique Numerique. -- Deconvolution --
Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen
(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
Solving polynomial least squares problems via semidefinite programming relaxations
Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose
Group Sparse Coding. Fernando Pereira Google Mountain View, CA [email protected]. Dennis Strelow Google Mountain View, CA strelow@google.
Group Sparse Coding Samy Bengio Google Mountain View, CA [email protected] Fernando Pereira Google Mountain View, CA [email protected] Yoram Singer Google Mountain View, CA [email protected] Dennis Strelow
Large-Scale Machine Learning with Stochastic Gradient Descent
Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA [email protected] Abstract. During the last decade, the data sizes have grown faster than
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method
The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
(Quasi-)Newton methods
(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
IEEE COMPUTING IN SCIENCE AND ENGINEERING, VOL. XX, NO. XX, XXX 2014 1
IEEE COMPUTING IN SCIENCE AND ENGINEERING, VOL. XX, NO. XX, XXX 2014 1 Towards IK-SVD: Dictionary Learning for Spatial Big Data via Incremental Atom Update Lizhe Wang 1, Ke Lu 2, Peng Liu 1, Rajiv Ranjan
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Sparsity-promoting recovery from simultaneous data: a compressive sensing approach
SEG 2011 San Antonio Sparsity-promoting recovery from simultaneous data: a compressive sensing approach Haneet Wason*, Tim T. Y. Lin, and Felix J. Herrmann September 19, 2011 SLIM University of British
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio Tuomas Virtanen, Member,
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
Cyber-Security Analysis of State Estimators in Power Systems
Cyber-Security Analysis of State Estimators in Electric Power Systems André Teixeira 1, Saurabh Amin 2, Henrik Sandberg 1, Karl H. Johansson 1, and Shankar Sastry 2 ACCESS Linnaeus Centre, KTH-Royal Institute
Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery
: Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)
STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing
Notes on Symmetric Matrices
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Stochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Efficient variable selection in support vector machines via the alternating direction method of multipliers
Efficient variable selection in support vector machines via the alternating direction method of multipliers Gui-Bo Ye Yifei Chen Xiaohui Xie University of California, Irvine University of California, Irvine
GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1
(R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter [email protected] February 20, 2009 Notice: This work was supported by the U.S.
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
Computing a Nearest Correlation Matrix with Factor Structure
Computing a Nearest Correlation Matrix with Factor Structure Nick Higham School of Mathematics The University of Manchester [email protected] http://www.ma.man.ac.uk/~higham/ Joint work with Rüdiger
A Stochastic 3MG Algorithm with Application to 2D Filter Identification
A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Class-specific Sparse Coding for Learning of Object Representations
Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems
Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001,
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery
Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery Zhilin Zhang and Bhaskar D. Rao Technical Report University of California at San Diego September, Abstract Sparse Bayesian
Couple Dictionary Training for Image Super-resolution
IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Couple Dictionary Training for Image Super-resolution Jianchao Yang, Student Member, IEEE, Zhaowen Wang, Student Member, IEEE, Zhe Lin, Member, IEEE, Scott Cohen,
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
Continuity of the Perron Root
Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North
TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS
TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS XIAOLONG LONG, KNUT SOLNA, AND JACK XIN Abstract. We study the problem of constructing sparse and fast mean reverting portfolios.
Image Super-Resolution as Sparse Representation of Raw Image Patches
Image Super-Resolution as Sparse Representation of Raw Image Patches Jianchao Yang, John Wright, Yi Ma, Thomas Huang University of Illinois at Urbana-Champagin Beckman Institute and Coordinated Science
When is missing data recoverable?
When is missing data recoverable? Yin Zhang CAAM Technical Report TR06-15 Department of Computational and Applied Mathematics Rice University, Houston, TX 77005 October, 2006 Abstract Suppose a non-random
Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure
Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission
GI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Sparse Prediction with the k-support Norm
Sparse Prediction with the -Support Norm Andreas Argyriou École Centrale Paris [email protected] Rina Foygel Department of Statistics, Stanford University [email protected] Nathan Srebro Toyota Technological
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
Fitting Subject-specific Curves to Grouped Longitudinal Data
Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: [email protected] Currie,
The Image Deblurring Problem
page 1 Chapter 1 The Image Deblurring Problem You cannot depend on your eyes when your imagination is out of focus. Mark Twain When we use a camera, we want the recorded image to be a faithful representation
Online Lazy Updates for Portfolio Selection with Transaction Costs
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
Massive Data Classification via Unconstrained Support Vector Machines
Massive Data Classification via Unconstrained Support Vector Machines Olvi L. Mangasarian and Michael E. Thompson Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Stochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine
Duality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
6.2 Permutations continued
6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of
Numerical Methods I Eigenvalue Problems
Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)
1 Teaching notes on GMM 1.
Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.
Case Study 3: fmri Prediction LASSO Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013 Emily Fo013 1 LASSO Regression LASSO: least
