Online Dictionary Learning for Sparse Coding

Size: px
Start display at page:

Download "Online Dictionary Learning for Sparse Coding"

Transcription

1 Julien Mairal Francis Bach INRIA, 45rued Ulm75005Paris,France Jean Ponce EcoleNormaleSupérieure, 45rued Ulm75005Paris,France Guillermo Sapiro University of Minnesota- Department of Electrical and Computer Engineering, 200 Union Street SE, Minneapolis, USA Abstract Sparse coding that is, modelling data vectors as sparse linear combinations of basis elements is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, an approach that has recently proven to be very effective for signal reconstruction and classification in the audio and image processing domains. This paper proposes a new online optimization algorithm for dictionary learning, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples. A proof of convergence is presented, along with experiments with natural images demonstrating that it leads to faster performance and better dictionaries than classical batch algorithms for both small and large datasets.. Introduction Thelineardecompositionofasignalusingafewatomsof a learned dictionary instead of a predefined one based on wavelets(mallat, 999) for example has recently led to state-of-the-art results for numerous low-level image processing tasks such as denoising(elad& Aharon, 2006) as well as higher-level tasks such as classification(raina et al., 2007; Mairal et al., 2009), showing that sparse learned models are well adapted to natural signals. Un- WILLOWProject, Laboratoired Informatiquedel Ecole Normale Supérieure, ENS/INRIA/CNRS UMR AppearinginProceedingsofthe 26 th InternationalConference on Machine Learning, Montreal, Canada, Copyright 2009 by the author(s)/owner(s). like decompositions based on principal component analysisanditsvariants,thesemodelsdonotimposethatthe basis vectors be orthogonal, allowing more flexibility to adapt the representation to the data. While learning the dictionary has proven to be critical to achieve(or improve upon) state-of-the-art results, effectively solving the corresponding optimization problem is a significant computational challenge, particularly in the context of the largescale datasets involved in image processing tasks, that may include millions of training samples. Addressing this challengeisthetopicofthispaper. Concretely,considerasignal xin R m.wesaythatitadmitsasparseapproximationoveradictionary Din R m k, with kcolumnsreferredtoasatoms,whenonecanfinda linearcombinationofa few atomsfrom Dthatis close to the signal x. Experiments have shown that modelling a signal with such a sparse decomposition(sparse coding) is very effective in many signal processing applications(chen et al., 999). For natural images, predefined dictionaries based on various types of wavelets(mallat, 999) have been used for this task. However, learning the dictionary instead of using off-the-shelf bases has been shown to dramatically improve signal reconstruction(elad& Aharon, 2006). Although some of the learned dictionary elements may sometimes look like wavelets(or Gabor filters), they aretunedtotheinputimagesorsignals,leadingtomuch better results in practice. Most recent algorithms for dictionary learning(olshausen &Field, 997; Aharonetal., 2006; Leeetal., 2007) are second-order iterative batch procedures, accessing the wholetrainingsetateachiterationinordertominimizea cost function under some constraints. Although they have shown experimentally to be much faster than first-order gradient descent methods(lee et al., 2007), they cannot effectively handle very large training sets(bottou& Bousquet, 2008), or dynamic training data changing over time,

2 such as video sequences. To address these issues, we propose an online approach that processes one element(or a smallsubset)ofthetrainingsetatatime.thisisparticularly important in the context of image and video processing(protter&elad,2009),whereitiscommontolearn dictionaries adapted to small patches, with training data that may include several millions of these patches(roughly oneperpixelandperframe). Inthissetting,onlinetechniques based on stochastic approximations are an attractive alternative to batch methods(bottou, 998). For example, first-order stochastic gradient descent with projections on the constraint set is sometimes used for dictionary learning(seeaharonandelad(2008)forinstance). Weshow inthispaperthatitispossibletogofurtherandexploitthe specificstructureofsparsecodinginthedesignofanoptimization procedure dedicated to the problem of dictionary learning, with low memory consumption and lower computational cost than classical second-order batch algorithms and without the need of explicit learning rate tuning. As demonstrated by our experiments, the algorithm scales up gracefully to large datasets with millions of training samples, and it is usually faster than more standard methods... Contributions This paper makes three main contributions. WecastinSection2thedictionarylearningproblemas the optimization of a smooth nonconvex objective function over a convex set, minimizing the(desired) expected cost when the training set size goes to infinity. WeproposeinSection3aniterativeonlinealgorithmthat solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the setofconstraints. ThismethodisshowninSection4to converge with probability one to a stationary point of the cost function. As shown experimentally in Section 5, our algorithm is significantly faster than previous approaches to dictionary learning on both small and large datasets of natural images. To demonstrate that it is adapted to difficult, largescale image-processing tasks, we learn a dictionary on a 2-Megapixel photograph and use it for inpainting. 2. Problem Statement Classical dictionary learning techniques (Olshausen & Field,997;Aharonetal.,2006;Leeetal.,2007)consider afinitetrainingsetofsignals X = [x,...,x n ]in R m n and optimize the empirical cost function f n (D) = n l(x i,d), () n i= where Din R m k isthedictionary,eachcolumnrepresentingabasisvector,and lisalossfunctionsuchthat l(x,d) shouldbesmallif Dis good atrepresentingthesignal x. Thenumberofsamples nisusuallylarge,whereasthesignaldimension misrelativelysmall,forexample, m = 00 for 0 0imagepatches,and n 00,000fortypical image processing applications. In general, we also have k n(e.g., k = 200for n = 00,000),andeachsignal onlyusesafewelementsof Dinitsrepresentation.Note that, in this setting, overcomplete dictionaries with k > m areallowed.asothers(see(leeetal.,2007)forexample), wedefine l(x,d)astheoptimalvalueofthe l -sparsecoding problem: l(x,d) = min α R k 2 x Dα λ α, (2) where λisaregularizationparameter. 2 Thisproblemis alsoknownasbasispursuit(chenetal.,999),orthe Lasso(Tibshirani, 996). Itiswellknownthatthe l penaltyyieldsasparsesolutionfor α,butthereisnoanalytic link between the value of λ and the corresponding effectivesparsity α 0.Toprevent Dfrombeingarbitrarily large(which would lead to arbitrarily small values of α),itiscommontoconstrainitscolumns (d j ) k j= tohave an l 2 normlessthanorequaltoone. Wewillcall Cthe convex set of matrices verifying this constraint: C = {D R m k s.t. j =,...,k, d T j d j }. (3) Note that the problem of minimizing the empirical cost f n (D)isnotconvexwithrespectto D. Itcanberewrittenasajointoptimizationproblemwithrespecttothedictionary Dandthecoefficients α = [α,...,α n ]ofthe sparse decomposition, which is not jointly convex, but convexwithrespecttoeachofthetwovariables Dand αwhen theotheroneisfixed: min D C,α R k n n n i= ( 2 x i Dα i λ α i ). (4) A natural approach to solving this problem is to alternate between the two variables, minimizing over one while keepingtheotheronefixed, asproposedbyleeetal. (2007)(seealsoAharonetal. (2006),whouse l 0 rather than l penalties, for related approaches). 3 Since the computation of α dominates the cost of each iteration, a second-order optimization technique can be used in this casetoaccuratelyestimate Dateachstepwhen αisfixed. As pointed out by Bottou and Bousquet(2008), however, one is usually not interested in a perfect minimization of 2 The l pnormofavector xin R m isdefined,for p,by x p = ( P m i= x[i] p ) /p. Followingtradition,wedenoteby x 0thenumberofnonzeroelementsofthevector x.this l 0 sparsitymeasureisnotatruenorm. 3 Inoursetting,asin(Leeetal.,2007),weusetheconvex l norm, that has empirically proven to be better behaved in general thanthe l 0pseudo-normfordictionarylearning.

3 theempiricalcost f n (D),butintheminimizationofthe expected cost f(d) = E x [l(x,d)] = lim n f n(d) a.s., (5) where the expectation(which is assumed finite) is taken relative to the(unknown) probability distribution p(x) of the data. 4 Inparticular,givenafinitetrainingset,oneshould not spend too much effort on accurately minimizing the empiricalcost,sinceitisonlyanapproximationoftheexpected cost. Bottou and Bousquet(2008) have further shown both theoretically and experimentally that stochastic gradient algorithms, whose rate of convergence is not good in conventional optimization terms, may in fact in certain settings be the fastest in reaching a solution with low expected cost. With large training sets, classical batch optimization techniques may indeed become impractical in terms of speed or memory requirements. In the case of dictionary learning, classical projected firstorder stochastic gradient descent(as used by Aharon and Elad(2008) for instance) consists of a sequence of updates of D: D t = Π C [D t ρ ] t Dl(x t,d t ), (6) where ρisthegradientstep, Π C istheorthogonalprojectoron C,andthetrainingset x,x 2,...arei.i.d.samples of the(unknown) distribution p(x). As shown in Section 5, we have observed that this method can be competitive compared to batch methods with large training sets, when agoodlearningrate ρisselected. The dictionary learning method we present in the next section falls into the class of online algorithms based on stochastic approximations, processing one sample at a time, but exploits the specific structure of the problem to efficiently solve it. Contrary to classical first-order stochastic gradient descent, it does not require explicit learning rate tuning and minimizes a sequentially quadratic local approximations of the expected cost. 3. Online Dictionary Learning Wepresentinthissectionthebasiccomponentsofouronline algorithm for dictionary learning(sections ), as well as two minor variants which speed up our implementation(section 3.4). 3.. Algorithm Outline Our algorithm is summarized in Algorithm. Assuming the training set composed of i.i.d. samples of a distribu- 4 Weuse a.s. (almostsure)todenoteconvergencewithprobability one. Algorithm Online dictionary learning. Require: x R m p(x)(randomvariableandanalgorithmtodrawi.i.dsamplesof p), λ R(regularization parameter), D 0 R m k (initialdictionary), T(number of iterations). : A 0 0, B 0 0(resetthe past information). 2: for t = to Tdo 3: Draw x t from p(x). 4: Sparse coding: compute using LARS α t = arg min α R 2 x t D t α λ α. (8) k 5: A t A t + α t α T t. 6: B t B t + x t α T t. 7: Compute D t usingalgorithm2,with D t aswarm restart, so that D t = arg min D C = arg min D C t t t i= 8: endfor 9: Return D T (learneddictionary). 2 x i Dα i λ α i, ( 2 Tr(DT DA t ) Tr(D T B t ) ). tion p(x),itsinnerloopdrawsoneelement x t atatime, as in stochastic gradient descent, and alternates classical sparsecodingstepsforcomputingthedecomposition α t of x t overthedictionary D t obtainedatthepreviousiteration, with dictionary update steps where the new dictionary D t iscomputedbyminimizingover Cthefunction ˆf t (D) = t t i= (9) 2 x i Dα i λ α i, (7) wherethevectors α i arecomputedduringtheprevious steps of the algorithm. The motivation behind our approach is twofold: Thequadraticfunction ˆf t aggregatesthepastinformation computed during the previous steps of the algorithm, namelythevectors α i,anditiseasytoshowthatitupperboundstheempiricalcost f t (D t )fromeq.(). One keyaspectoftheconvergenceanalysiswillbetoshowthat ˆf t (D t )and f t (D t )convergesalmostsurelytothesame limitandthus ˆf t actsasasurrogatefor f t. Since ˆf t iscloseto ˆf t, D t canbeobtainedefficiently using D t aswarmrestart Sparse Coding ThesparsecodingproblemofEq.(2)withfixeddictionaryisan l -regularizedlinearleast-squaresproblem. A

4 Algorithm 2 Dictionary Update. Require: D = [d,...,d k ] R m k (inputdictionary), A = [a,...,a k ] R k k = t i= α iα T i, B = [b,...,b k ] R m k = t i= x iα T i. : repeat 2: for j = to kdo 3: Update the j-th column to optimize for(9): u j A jj (b j Da j ) + d j. d j max( u j 2,) u j. 4: endfor 5: until convergence 6: Return D(updated dictionary). (0) number of recent methods for solving this type of problems are based on coordinate descent with soft thresholding(fu, 998; Friedman et al., 2007). When the columns of the dictionary have low correlation, these simple methods have proven to be very efficient. However, the columns of learned dictionaries are in general highly correlated, and we have empirically observed that a Cholesky-based implementation of the LARS-Lasso algorithm, an homotopy method(osborneetal.,2000;efronetal.,2004)thatprovides the whole regularization path that is, the solutions forallpossiblevaluesof λ,canbeasfastasapproaches based on soft thresholding, while providing the solution with a higher accuracy Dictionary Update Our algorithm for updating the dictionary uses blockcoordinate descent with warm restarts, and one of its main advantages is that it is parameter-free and does not require anylearningratetuning,whichcanbedifficultinaconstrained optimization setting. Concretely, Algorithm 2 sequentially updates each column of D. Using some simple algebra,itiseasytoshowthateq.(0)givesthesolution ofthedictionaryupdate(9)withrespecttothe j-thcolumn d j,whilekeepingtheotheronesfixedundertheconstraint d T j d j. Sincethisconvexoptimizationproblemadmits separable constraints in the updated blocks(columns), convergence to a global optimum is guaranteed(bertsekas, 999).Inpractice,sincethevectors α i aresparse,thecoefficientsofthematrix Aareingeneralconcentratedonthe diagonal, which makes the block-coordinate descent more efficient. 5 Sinceouralgorithmusesthevalueof D t asa 5 Notethatthisassumptiondoesnotexactlyhold:Tobemore exact,ifagroupofcolumnsin Darehighlycorrelated,thecoefficients of the matrix A can concentrate on the corresponding principal submatrices of A. warmrestartforcomputing D t,asingleiterationhasempirically been found to be enough. Other approaches have beenproposedtoupdate D,forinstance,Leeetal.(2007) suggestusinganewtonmethodonthedualofeq.(9),but thisrequiresinvertingak kmatrixateachnewtoniteration, which is impractical for an online algorithm Optimizing the Algorithm Wehavepresentedsofarthebasicbuildingblocksofour algorithm. This section discusses simple improvements that significantly enhance its performance. Handling Fixed-Size Datasets. In practice, although it maybeverylarge,thesizeofthetrainingsetisoftenfinite(ofcoursethismaynotbethecase,whenthedata consistsofavideostreamthatmustbetreatedonthefly for example). In this situation, the same data points may beexaminedseveraltimes,anditisverycommoninonline algorithms to simulate an i.i.d. sampling of p(x) by cycling over a randomly permuted training set(bottou& Bousquet, 2008). This method works experimentally well inoursettingbut,whenthetrainingsetissmallenough, it is possible to further speed up convergence: In Algorithm,thematrices A t and B t carryalltheinformation fromthepastcoefficients α,...,α t.supposethatattime t 0,asignal xisdrawnandthevector α t0 iscomputed.if thesamesignal xisdrawnagainattime t > t 0,onewould liketoremovethe old informationconcerning xfrom A t and B t thatis,write A t A t + α t α T t α t0 α T t 0 for instance. When dealing with large training sets, it is impossibletostoreallthepastcoefficients α t0,butitisstill possible to partially exploit the same idea, by carrying in A t and B t theinformationfromthecurrentandprevious epochs(cycles through the data) only. Mini-Batch Extension. In practice, we can improve the convergencespeedofouralgorithmbydrawing η > signalsateachiterationinsteadofasingleone,whichis a classical heuristic in stochastic gradient descent algorithms. Letusdenote x t,,...,x t,η thesignalsdrawnat iteration t.wecanthenreplacethelines 5and 6ofAlgorithmby { At βa t + η i= α t,iα T t,i, B t βb t + η i= xαt t,i, () where βischosensothat β = θ+ η θ+,where θ = tηif t < ηand η 2 +t ηif t η,whichiscompatiblewithour convergence analysis. Purging the Dictionary from Unused Atoms. Every dictionary learning technique sometimes encounters situations wheresomeofthedictionaryatomsarenever(orveryseldom) used, which happens typically with a very bad intialization. A common practice is to replace them during the

5 optimization by elements of the training set, which solves in practice this problem in most cases. 4. Convergence Analysis Although our algorithm is relatively simple, its stochastic nature and the non-convexity of the objective function make the proof of its convergence to a stationary point somewhat involved. The main tools used in our proofs are the convergence of empirical processes(van der Vaart, 998) and, following Bottou(998), the convergence of quasi-martingales(fisk, 965). Our analysis is limited to thebasicversionofthealgorithm,althoughitcaninprinciple be carried over to the optimized version discussed in Section 3.4. Because of space limitations, we will restrict ourselves to the presentation of our main results and asketchoftheirproofs,whichwillbepresentedindetails elsewhere, and first the(reasonable) assumptions under which our analysis holds. 4.. Assumptions (A) The data admits a bounded probability density p with compact support K. Assuming a compact support forthedataisnaturalinaudio,image,andvideoprocessing applications, where it is imposed by the data acquisition process. (B)Thequadraticsurrogatefunctions ˆf t arestrictly convex with lower-bounded Hessians. We assume that the smallest eigenvalue of the semi-definite positive matrix t A tdefinedinalgorithmisgreaterthanorequal toanon-zeroconstant κ (making A t invertibleand ˆf t strictly convex with Hessian lower-bounded). This hypothesis is in practice verified experimentally after a few iterations of the algorithm when the initial dictionary is reasonable, consisting for example of a few elements from the training set, or any one of the off-the-shelf dictionaries, such as DCT(bases of cosines products) or wavelets. Note thatitiseasytoenforcethisassumptionbyaddingaterm κ 2 D 2 F totheobjectivefunction,whichisequivalentin practicetoreplacingthepositivesemi-definitematrix t A t by t A t + κ I.Wehaveomittedforsimplicitythispenalization in our analysis. (C) A sufficient uniqueness condition of the sparse codingsolutionisverified:givensome x K,where Kis thesupportof p,and D C,letusdenoteby Λthesetof indices jsuchthat d T j (x Dα ) = λ,where α isthe solutionofeq.(2). Weassumethatthereexists κ 2 > 0 suchthat,forall xin Kandalldictionaries Dinthesubset S of C considered by our algorithm, the smallest eigenvalueof D T Λ D Λisgreaterthanorequalto κ 2.Thismatrix is thus invertible and classical results(fuchs, 2005) ensure the uniqueness of the sparse coding solution. It is of course easy to build a dictionary D for which this assumption fails. However,having D T Λ D ΛinvertibleisacommonassumptioninlinearregressionandinmethodssuchastheLARS algorithmaimedatsolvingeq.(2)(efronetal.,2004).it is also possible to enforce this condition using an elastic netpenalization(zou&hastie,2005),replacing α by α + κ2 2 α 2 2andthusimprovingthenumericalstability of homotopy algorithms such as LARS. Again, we have omitted this penalization for simplicity Main Results and Proof Sketches Givenassumptions(A)to(C),letusnowshowthatour algorithm converges to a stationary point of the objective function. Proposition (convergenceof f(d t )andofthesurrogatefunction). Let ˆft denotethesurrogatefunction definedineq.(7).underassumptions(a)to(c): ˆf t (D t )convergesa.s.; f(d t ) ˆf t (D t )convergesa.s.to 0;and f(d t )convergesa.s. Proofsktech: Thefirststepintheproofistoshowthat D t D t = O ( ) t which,althoughitdoesnotensure theconvergenceof D t,ensurestheconvergenceoftheseries t= D t D t 2 F,aclassicalconditioningradient descent convergence proofs(bertsekas, 999). In turn, thisreducestoshowingthat D t minimizesaparametrized quadraticfunctionover Cwithparameters t A tand t B t, then showing that the solution is uniformly Lipschitz with respect to these parameters, borrowing some ideas from perturbation theory(bonnans& Shapiro, 998). At this point, and following Bottou(998), proving the convergenceofthesequence ˆf t (D t )amountstoshowingthatthe stochastic positive process u t = ˆft (D t ) 0, (2) isaquasi-martingale.todoso,denotingby F t thefiltration of the past information, a theorem by Fisk(965) states thatifthepositivesum t= E[max(E[u t+ u t F t ],0)] converges,then u t isaquasi-martingalewhichconverges with probability one. Using some results on empirical processes(van der Vaart, 998, Chap. 9.2, Donsker Theorem), we obtain a bound that ensures the convergence ofthisseries. Itfollowsfromtheconvergenceof u t that f t (D t ) ˆf t (D t )convergestozerowithprobabilityone. Then, a classical theorem from perturbation theory(bonnans&shapiro,998,theorem4.)showsthat l(x,d) is C. This,allowsustousealastresultonempirical processesensuringthat f(d t ) ˆf t (D t )convergesalmost surelyto 0.Therefore f(d t )convergesaswellwithprobability one.

6 Proposition 2 (convergence to a stationary point). Underassumptions(A)to(C), D t isasymptoticallycloseto the set of stationary points of the dictionary learning problem with probability one. Proofsktech:Thefirststepintheproofistoshowusing classical analysis tools that, given assumptions(a) to(c), fis C withalipschitzgradient. Considering Ãand B twoaccumulationpointsof t A tand t B trespectively,we candefinethecorrespondingsurrogatefunction ˆf such thatforall Din C, ˆf (D) = 2 Tr(DT DÃ) Tr(D T B), anditsoptimum D on C.Thenextstepconsistsofshowingthat ˆf (D ) = f(d )andthat f(d )isin thenormalconeoftheset C thatis, D isastationary point of the dictionary learning problem (Borwein & Lewis, 2006). 5. Experimental Validation In this section, we present experiments on natural images to demonstrate the efficiency of our method. 5.. Performance evaluation Forourexperiments,wehaverandomlyselected patches from images in the Berkeley segmentation dataset, whichisastandardimagedatabase; 0 6 ofthesearekept for training, and the rest for testing. We used these patches tocreatethreedatasets A, B,and Cwithincreasingpatch and dictionary sizes representing various typical settings in image processing applications: Data Signalsize m Nb kofatoms Type A 8 8 = b&w B = color C 6 6 = b&w Wehavenormalizedthepatchestohaveunit l 2 -normand usedtheregularizationparameter λ =.2/ minallof ourexperiments.the / mtermisaclassicalnormalization factor(bickel et al., 2007), and the constant.2 has been experimentally shown to yield reasonable sparsities (about 0 nonzero coefficients) in these experiments. We have implemented the proposed algorithm in C++ with a Matlab interface. All the results presented in this section use the mini-batch refinement from Section 3.4 since this hasshownempiricallytoimprovespeedbyafactorof0 ormore.thisrequirestotunetheparameter η,thenumber of signals drawn at each iteration. Trying different powers of2forthisvariablehasshownthat η = 256wasagood choice(lowest objective function values on the training set empirically, this setting also yields the lowest values on thetestset),butvaluesof28andand52havegivenvery similar performances. Our implementation can be used in both the online setting itisintendedfor,andinaregularbatchmodewhereit uses the entire dataset at each iteration(corresponding to themini-batchversionwith η = n).wehavealsoimplemented a first-order stochastic gradient descent algorithm that shares most of its code with our algorithm, except for the dictionary update step. This setting allows us to draw meaningful comparisons between our algorithm and its batch and stochastic gradient alternatives, which would have been difficult otherwise. For example, comparing our algorithm to the Matlab implementation of the batch approach from(lee et al., 2007) developed by its authors wouldhavebeenunfairsinceourc++programhasabuiltin speed advantage. Although our implementation is multithreaded, our experiments have been run for simplicity on a single-cpu, single-core 2.4Ghz machine. To measure and compare the performances of the three tested methods, we haveplottedthevalueoftheobjectivefunctiononthetest set,actingasasurrogateoftheexpectedcost,asafunction of the corresponding training time. Online vs Batch. Figure (top) compares the online and batch settings of our implementation. The full training set consistsof 0 6 samples. Theonlineversionofouralgorithmdrawssamplesfromtheentireset,andwehaverun itsbatchversiononthefulldatasetaswellassubsetsofsize 0 4 and 0 5 (seefigure).theonlinesettingsystematically outperforms its batch counterpart for every training set size and desired precision. We use a logarithmic scale for the computation time, which shows that in many situations, the difference in performance can be dramatic. Similar experiments have given similar results on smaller datasets. Comparison with Stochastic Gradient Descent. Our experiments have shown that obtaining good performance with stochastic gradient descent requires using both the mini-batch heuristic and carefully choosing the learning rate ρ. To give the fairest comparison possible, we have thus optimized these parameters, sampling η values among powersof2(asbefore)and ρvaluesamongpowersof0. Thecombinationofvalues ρ = 0 4, η = 52givesthe best results on the training and test data for stochastic gradient descent. Figure (bottom) compares our method with stochastic gradient descent for different ρ values around 0 4 andafixedvalueof η = 52. Weobservethatthe largerthevalueof ρis,thebettertheeventualvalueofthe objective function is after many iterations, but the longer it will take to achieve a good precision. Although our method performs better at such high-precision settings for dataset C,itappearsthat,ingeneral,foradesiredprecisionanda particular dataset, it is possible to tune the stochastic gradient descent algorithm to achieve a performance similar to that of our algorithm. Note that both stochastic gradient descent and our method only start decreasing the objective function value after a few iterations. Slightly better results could be obtained by using smaller gradient steps

7 Evaluation set A Batch n=0 4 Batch n=0 5 Batch n= Evaluation set A SG ρ=5.0 3 SG ρ=0 4 SG ρ= Evaluation set B Batch n=0 4 Batch n=0 5 Batch n= Evaluation set B SG ρ=5.0 3 SG ρ=0 4 SG ρ= Batch n=0 4 Batch n=0 5 Batch n=0 6 Evaluation set C Evaluation set C SG ρ=5.0 3 SG ρ=0 4 SG ρ= Figure. Top: Comparison between online and batch learning for various training set sizes. Bottom: Comparison between our method and stochastic gradient(sg) descent with different learning rates ρ. In both cases, the value of the objective function evaluated on the test set is reported as a function of computation time on a logarithmic scale. Values of the objective function greater than its initial value are truncated. during the first iterations, using a learning rate of the form ρ/(t + t 0 )forthestochasticgradientdescent,andinitializing A 0 = t 0 Iand B 0 = t 0 D 0 forthematrices A t and B t, where t 0 isanewparameter Application to Inpainting Our last experiment demonstrates that our algorithm can be used for a difficult large-scale image processing task, namely, removing the text(inpainting) from the damaged 2-Megapixel image of Figure 2. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256elementsfromtheroughly undamaged 2 2colorpatchesintheimagewithtwoepochsin about 500 seconds on a 2.4GHz machine with eight cores. Once the dictionary has been learned, the text is removed using the sparse coding technique for inpainting of Mairal etal. (2008). Ourintenthereisofcoursenottoevaluate our learning procedure in inpainting tasks, which would require a thorough comparison with state-the-art techniques on standard datasets. Instead, we just wish to demonstrate thattheproposedmethodcanindeedbeappliedtoarealistic, non-trivial image processing task on a large image.indeed,tothebestofourknowledge,thisisthefirst time that dictionary learning is used for image restoration on such large-scale data. For comparison, the dictionaries used for inpainting in the state-of-the-art method of Mairal etal.(2008)arelearned(inbatchmode)ononly200,000 patches. 6. Discussion Wehaveintroducedinthispaperanewstochasticonlinealgorithm for learning dictionaries adapted to sparse coding tasks, and proven its convergence. Preliminary experiments demonstrate that it is significantly faster than batch alternatives on large datasets that may contain millions of training examples, yet it does not require learning rate tuning like regular stochastic gradient descent methods. More experiments are of course needed to better assess the promise of this approach in image restoration tasks such as denoising, deblurring, and inpainting. Beyond this, we plan to use the proposed learning framework for sparse coding in computationally demanding video restoration tasks(protter&elad,2009),withdynamicdatasetswhosesizeisnot fixed, and also plan to extend this framework to different loss functions to address discriminative tasks such as image classification(mairal et al., 2009), which are more sensitive to overfitting than reconstructive ones, and various matrix factorization tasks, such as non-negative matrix factorization with sparseness constraints and sparse principal component analysis. Acknowledgments ThispaperwassupportedinpartbyANRundergrant MGA. The work of Guillermo Sapiro is partially supported byonr,nga,nsf,aro,anddarpa.

8 Online Dictionary Learning for Sparse Coding Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions Image Processing, 54, Fisk, D. (965). Quasi-martingale. Transactions of the American Mathematical Society, Friedman, J., Hastie, T., Ho lfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Statistics,, Fu, W. (998). Penalized Regressions: The Bridge Versus the Lasso. Journal of computational and graphical statistics, 7, Figure 2. Inpainting example on a 2-Megapixel image. Top: Damaged and restored images. Bottom: Zooming on the damaged and restored images. (Best seen in color) Fuchs, J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions Information Theory, 5, References Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2007). Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 9, Aharon, M., & Elad, M. (2008). Sparse and redundant modeling of image content using an image-signaturedictionary. SIAM Imaging Sciences,, Mairal, J., Elad, M., & Sapiro, G. (2008). Sparse representation for color image restoration. IEEE Transactions Image Processing, 7, Aharon, M., Elad, M., & Bruckstein, A. M. (2006). The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions Signal Processing, 54, Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). Supervised dictionary learning. Advances in Neural Information Processing Systems, 2, Bertsekas, D. (999). Nonlinear programming. Athena Scientific Belmont, Mass. Bickel, P., Ritov, Y., & Tsybakov, A. (2007). Simultaneous analysis of Lasso and Dantzig selector. preprint. Bonnans, J., & Shapiro, A. (998). Optimization problems with perturbation: A guided tour. SIAM Review, 40, Borwein, J., & Lewis, A. (2006). Convex analysis and nonlinear optimization: theory and examples. Springer. Bottou, L. (998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning and neural networks. Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learning. Advances in Neural Information Processing Systems, 20, Chen, S., Donoho, D., & Saunders, M. (999). Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20, Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, Mallat, S. (999). A wavelet tour of signal processing, second edition. Academic Press, New York. Olshausen, B. A., & Field, D. J. (997). Sparse coding with an overcomplete basis set: A strategy employed by V? Vision Research, 37, Osborne, M., Presnell, B., & Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20, Protter, M., & Elad, M. (2009). Image sequence denoising via sparse and redundant representations. IEEE Transactions Image Processing, 8, Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the 26th International Conference on Machine Learning, Tibshirani, R. (996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 67, Van der Vaart, A. (998). Asymptotic Statistics. Cambridge University Press. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67,

Online Learning for Matrix Factorization and Sparse Coding

Online Learning for Matrix Factorization and Sparse Coding Journal of Machine Learning Research 11 (2010) 19-60 Submitted 7/09; Revised 11/09; Published 1/10 Online Learning for Matrix Factorization and Sparse Coding Julien Mairal Francis Bach INRIA - WILLOW Project-Team

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Part II Redundant Dictionaries and Pursuit Algorithms

Part II Redundant Dictionaries and Pursuit Algorithms Aisenstadt Chair Course CRM September 2009 Part II Redundant Dictionaries and Pursuit Algorithms Stéphane Mallat Centre de Mathématiques Appliquées Ecole Polytechnique Sparsity in Redundant Dictionaries

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Bag of Pursuits and Neural Gas for Improved Sparse Coding Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Machine learning and optimization for massive data

Machine learning and optimization for massive data Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Computational Optical Imaging - Optique Numerique. -- Deconvolution -- Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d). 1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen (für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

Solving polynomial least squares problems via semidefinite programming relaxations

Solving polynomial least squares problems via semidefinite programming relaxations Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose

More information

Group Sparse Coding. Fernando Pereira Google Mountain View, CA [email protected]. Dennis Strelow Google Mountain View, CA strelow@google.

Group Sparse Coding. Fernando Pereira Google Mountain View, CA pereira@google.com. Dennis Strelow Google Mountain View, CA strelow@google. Group Sparse Coding Samy Bengio Google Mountain View, CA [email protected] Fernando Pereira Google Mountain View, CA [email protected] Yoram Singer Google Mountain View, CA [email protected] Dennis Strelow

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA [email protected] Abstract. During the last decade, the data sizes have grown faster than

More information

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

More information

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

IEEE COMPUTING IN SCIENCE AND ENGINEERING, VOL. XX, NO. XX, XXX 2014 1

IEEE COMPUTING IN SCIENCE AND ENGINEERING, VOL. XX, NO. XX, XXX 2014 1 IEEE COMPUTING IN SCIENCE AND ENGINEERING, VOL. XX, NO. XX, XXX 2014 1 Towards IK-SVD: Dictionary Learning for Spatial Big Data via Incremental Atom Update Lizhe Wang 1, Ke Lu 2, Peng Liu 1, Rajiv Ranjan

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Sparsity-promoting recovery from simultaneous data: a compressive sensing approach

Sparsity-promoting recovery from simultaneous data: a compressive sensing approach SEG 2011 San Antonio Sparsity-promoting recovery from simultaneous data: a compressive sensing approach Haneet Wason*, Tim T. Y. Lin, and Felix J. Herrmann September 19, 2011 SLIM University of British

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio Tuomas Virtanen, Member,

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Cyber-Security Analysis of State Estimators in Power Systems

Cyber-Security Analysis of State Estimators in Power Systems Cyber-Security Analysis of State Estimators in Electric Power Systems André Teixeira 1, Saurabh Amin 2, Henrik Sandberg 1, Karl H. Johansson 1, and Shankar Sastry 2 ACCESS Linnaeus Centre, KTH-Royal Institute

More information

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery : Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly) STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing

More information

Notes on Symmetric Matrices

Notes on Symmetric Matrices CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Efficient variable selection in support vector machines via the alternating direction method of multipliers

Efficient variable selection in support vector machines via the alternating direction method of multipliers Efficient variable selection in support vector machines via the alternating direction method of multipliers Gui-Bo Ye Yifei Chen Xiaohui Xie University of California, Irvine University of California, Irvine

More information

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1 (R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter [email protected] February 20, 2009 Notice: This work was supported by the U.S.

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

Computing a Nearest Correlation Matrix with Factor Structure

Computing a Nearest Correlation Matrix with Factor Structure Computing a Nearest Correlation Matrix with Factor Structure Nick Higham School of Mathematics The University of Manchester [email protected] http://www.ma.man.ac.uk/~higham/ Joint work with Rüdiger

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

Subspace Analysis and Optimization for AAM Based Face Alignment

Subspace Analysis and Optimization for AAM Based Face Alignment Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Class-specific Sparse Coding for Learning of Object Representations

Class-specific Sparse Coding for Learning of Object Representations Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany

More information

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001,

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery Zhilin Zhang and Bhaskar D. Rao Technical Report University of California at San Diego September, Abstract Sparse Bayesian

More information

Couple Dictionary Training for Image Super-resolution

Couple Dictionary Training for Image Super-resolution IEEE TRANSACTIONS ON IMAGE PROCESSING 1 Couple Dictionary Training for Image Super-resolution Jianchao Yang, Student Member, IEEE, Zhaowen Wang, Student Member, IEEE, Zhe Lin, Member, IEEE, Scott Cohen,

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

Continuity of the Perron Root

Continuity of the Perron Root Linear and Multilinear Algebra http://dx.doi.org/10.1080/03081087.2014.934233 ArXiv: 1407.7564 (http://arxiv.org/abs/1407.7564) Continuity of the Perron Root Carl D. Meyer Department of Mathematics, North

More information

TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS

TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS XIAOLONG LONG, KNUT SOLNA, AND JACK XIN Abstract. We study the problem of constructing sparse and fast mean reverting portfolios.

More information

Image Super-Resolution as Sparse Representation of Raw Image Patches

Image Super-Resolution as Sparse Representation of Raw Image Patches Image Super-Resolution as Sparse Representation of Raw Image Patches Jianchao Yang, John Wright, Yi Ma, Thomas Huang University of Illinois at Urbana-Champagin Beckman Institute and Coordinated Science

More information

When is missing data recoverable?

When is missing data recoverable? When is missing data recoverable? Yin Zhang CAAM Technical Report TR06-15 Department of Computational and Applied Mathematics Rice University, Houston, TX 77005 October, 2006 Abstract Suppose a non-random

More information

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Sparse Prediction with the k-support Norm

Sparse Prediction with the k-support Norm Sparse Prediction with the -Support Norm Andreas Argyriou École Centrale Paris [email protected] Rina Foygel Department of Statistics, Stanford University [email protected] Nathan Srebro Toyota Technological

More information

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: [email protected] Currie,

More information

The Image Deblurring Problem

The Image Deblurring Problem page 1 Chapter 1 The Image Deblurring Problem You cannot depend on your eyes when your imagination is out of focus. Mark Twain When we use a camera, we want the recorded image to be a faithful representation

More information

Online Lazy Updates for Portfolio Selection with Transaction Costs

Online Lazy Updates for Portfolio Selection with Transaction Costs Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano

More information

Stochastic Inventory Control

Stochastic Inventory Control Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

More information

Massive Data Classification via Unconstrained Support Vector Machines

Massive Data Classification via Unconstrained Support Vector Machines Massive Data Classification via Unconstrained Support Vector Machines Olvi L. Mangasarian and Michael E. Thompson Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Duality of linear conic problems

Duality of linear conic problems Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

6.2 Permutations continued

6.2 Permutations continued 6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of

More information

Numerical Methods I Eigenvalue Problems

Numerical Methods I Eigenvalue Problems Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 [email protected] 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)

More information

1 Teaching notes on GMM 1.

1 Teaching notes on GMM 1. Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.

LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013. Case Study 3: fmri Prediction LASSO Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013 Emily Fo013 1 LASSO Regression LASSO: least

More information