SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE By Guillaume Lecué CNRS, LAMA, Mare-la-vallée, 77454 Frace ad By Shahar Medelso Departmet of Mathematics, Techio, I.I.T, Haifa 32000, Israel We apply Theorem A to the problem of Covex aggregatio ad show that the optimal rate of Covex aggregatio for o-exact oracle iequalities is much faster tha the optimal rate for exact oracle iequalities. We apply Theorem B to show that regularized procedures based o a uclear orm criterio satisfy oracle iequalities with a residual term that decreases like 1/ for every L q-loss fuctios q 2, while oly assumig that the tail behaviour of the iput ad output variables are well behaved. I particular, a RIP type of assumptio or a icoherece coditio are ot eeded to obtai fast residual terms i this setup. Fially, we relate the problem of Model Selectio to the problem of regularizatio ad apply Theorem B to obtai o-exact oracle iequalities i the Model Selectio setup. 1. Applicatio to the predictio of low-rak matrices. For this applicatio, we observe i.i.d. couples iput/output X i, Y i 1 i where the iput variables X 1,..., X take their values i the space X = M m T of all m T matrices with etries i R ad the output variables Y 1,..., Y are real-valued. Beig give a ew iput X, the goal is to predict the output Y usig a liear fuctio of X whe X, Y is assumed to have the same probability distributio as the X i, Y i s. I this setup, it is ow commo to assume that there are more covariables tha observatios mt >> ad This work is supported by the Frech Agece Natioale de la Recherce ANR ANR Grat Progostic ANR-09-JCJC-0101-01. partially supported by the Mathematical Scieces Istitute The Australia Natioal Uiversity, The Europea Research Coucil uder ERC grat agreemet o [203134] ad the Australia Research Coucil uder grat DP0986563 AMS 2000 subject classificatios: Primary 62G05; secodary 62H30, 68T10. Keywords ad phrases: classificatio, statistical learig, fast rates of covergece, excess risk, aggregatio, margi, complexity of classes of sets, SVM. 1
2 G. LECUÉ AND S. MENDELSON thus more iformatio o the best liear predictio of Y by X is required. A commo assumptio is that Y ca be well predicted by a fuctio of the form X, A 0 = TrX A 0 where A 0 is a m T matrix of low rak. Oce agai we will ot have to make such a assumptio, but it helps to keep this low-dimesioal structure i mid. Ideed, with a small rak ituitio, it is atural to pealize liear estimators X, A by raka. Ufortuately, sice the rak fuctio is ot covex it caot be used i practice as a criterio. A more popular choice is to use a covex relaxatio of the rak fuctio: the S 1 orm Schatte oe orm see [1, 3, 4, 5, 7, 11, 8, 10, 18, 20, 13] ad refereces therei, which is the l 1 -orm of the sigular values of a matrix. Formally, for every A M m T, A S1 = m T i=1 s ia, where s 1 A,..., s m T A are the sigular values of A ad, i geeral for p 1, A Sp = m T i=1 s ia p 1/p. The S 1 -orm was origially used i this type of problems to study exact recostructio see, for example, [7, 19, 6], but other regularizig fuctios have bee used i this cotext e.g. [12, 9, 8] for the predictio ad estimatio problems. I the followig result, we apply Theorem B to obtai o-exact oracle iequalities for a S 1 -based criterio procedure, uder a L q -loss fuctio for some q 2. For every A M m T let R q A = E Y X, A q ad R q A = 1 Y i X i, A q. i=1 Agai, it seems more statistically relevat to assume that Y ad X S2 are almost surely bouded rather tha bouded i ψ q for q > 2, ad the two most iterestig cases are the uiformly bouded oe ad q = 2. We have stated the results uder the more geeral ψ q assumptio to poit out the places i which the decay properties of the fuctios ivolved are really eeded i the hope that it would be possible to improve ad exted the results at a later date, by relaxig the ψ q boudedess assumptio. Theorem D For every q 2 there are costats c 0 ad c 1 depedig oly o q for which the followig holds. Let m ad T as above ad assume ψq that Y ψq, X S2 KmT for some costat KmT which depeds oly o the product mt. Let x > 0 ad 0 < ɛ < 1/2, ad put λ, mt, x = c 0 KmT q log 4q 2/q x + log. Cosider the procedure  Arg mi A M m T R q A + λ, mt, x A q S 1 ɛ 2
NON-EXACT ORACLE INEQUALITIES 3 The, with probability greater tha 1 10 exp x, the L q -risk of  satisfies for η ɛ, mt, x = c 1 KmT q log 4q 2/q x + log R q  if 1 + 2ɛR q A + η, mt, x 1 + A q S 1 A M m T ɛ 2. Sketch of the proof of Theorem D. The proof of Theorem D follows the same lie as the oe of Theorem C. The oly differet igrediet is a etropy estimate that ca be foud i [8] o the complexity of the Schatte S 1 -ball. Propositio 1.1 [8]. There exists a absolute costat c 0 > 0 such ψ2 that the followig holds. Assume that X S2 KmT. The, Eγ 2 2 rbs 1,, 1/2 c0 KmT r log. Oce agai, i the same spirit as i Theorem C, it ca be iterestig to ote that for the quadratic loss q = 2, the resultig estimator is 1  Arg mi A M m T Y i X i, A 2 + λ, mt, x A 2 S 1 ɛ i=1 2 where the regularizig fuctio uses the square of the S 1 -orm ulike the classical estimator for this problem which uses the S 1 orm itself as a regularizig fuctio. The first results i the directio of matrix completio have focused o the exact recostructio of a low-rak matrix A 0 where Y = X, A 0 [3, 4, 7, 19, 10]. The best results [19, 10] to date are that if the umber of measuremets is larger tha raka 0 m + T logm + T ad if the icoherece coditio holds see [7] for more details, the with high probability, a costrait uclear orm miimizatio algorithm ca recostruct A 0 exactly. Predictio results ad statistical estimatio ivolvig low-rak matrices has become a very active field. The most popular methods are based o S 1 -orm pealty fuctios see for istace [1, 2, 3, 4, 18, 20, 8, 12, 13, 20]. To specify some results, fast rates for the oisy matrix completio problem are derived i [20] i the cotext of empirical predictio ad uder a RIP-type assumptio. I [13] the authors prove exact oracle iequalities for the predictio error E X,  A 0 2 whe E Y X = X, A0 ad A0
4 G. LECUÉ AND S. MENDELSON is either of low-rak or of a small S 1 -orm, ad whe the desig of X is kow. I [18], optimal rates for the quadratic risk were obtaied uder a spikiess assumptio o the SVD of A 0, ad i [12], fast covergece rates were derived for based o the vo Neuma etropy pealizatio ad for a kow desig. However, so far oly a few results have bee obtaied for the predictio risk as cosidered here. Probably the closest result i this setup is a exact oracle iequality with slow rates satisfied by a usig a mixture of several orms i [8]. Note that for the two applicatios i Theorem C ad D, we obtai fast covergece rates uder oly tail assumptios o the desig X ad the output Y for every L q -loss for q 2. I particular, oe does ot eed to assume that E Y X is a liear combiatio of the covariables of X, or that Y has ay low-dimesioal structure. If oe happes to be i a lowdimesioal situatio, the residual terms of Theorem C ad D will be small. Hece, the l 1 ad S 1 based procedures used there automatically adapt to this low-dimesioal structure. 2. No-exact oracle iequalities for the Covex aggregatio problem. The problem of Covex aggregatio is the followig: cosider a fiite model F = {f 1,..., f M } for some M N ad try to fid a procedure that is as good as the best covex combiatio of elemets i F. To defie what is meat by as good as, we itroduce some otatio. For ay λ = λ 1,..., λ M R M, let f λ = M j=1 λ j f j ad the covex { hull of F is the set covf = f λ : } M j=1 λ j = 1 ad λ j 0. There are may differet ways of defiig the covex aggregatio problem. The oe that we will be iterested i is the followig: for some 0 ɛ 1/2 costruct a procedure f such that, for ay x > 0 with probability larger tha 1 exp x, 2.1 R f 1 + ɛ if f covf Rf + r M where the residual term r M should be as small as possible. From both mathematical ad statistical poit of view, the most iterestig case to study is for ɛ = 0. I this case, it follows from classical miimax results cf. [21] that o algorithm ca do better tha the rate 2.2 ψ C M = log M/ if M, em/ otherwise. It is show i [21] that there is a procedure f achievig this rate i expectatio: ER f if f covf Rf + ψ C M ad i [14], the ERM is proved
NON-EXACT ORACLE INEQUALITIES 5 to achieve this optimal rate i deviatio we refer to [21] ad [14] for more details. I this setup, we apply Theorem A to obtai iequalities like 2.1 with 0 < ɛ < 1/2 for the ERM over covf : 2.3 f ERM C Arg mi f covf R f. To make the argumet simple, we cosider the bouded regressio framework with respect to the square loss: Y, sup f F fx 1 a.s. ad l f x, y = y fx 2, f F, x, y X R. Theorem E. There exists a absolute costat c 0 such that the followig holds. For ay 0 < x < log ad 0 < ɛ < 1/2, with probability greater tha 1 8 exp x, ERM C R f 1 + 3ɛ if Rf + c 0log Mlog. f covf ɛ Sketch of the proof of Theorem E. The proof of Theorem E ad Theorem C are closely related. I the case of Theorem C, the result follows from the aalysis of the loss fuctios classes idexed for the family of models rb1 d r 0. I the case of Theorem E, the result follows from the aalysis of the loss fuctio class idexed by the model Λ = {λ R M : λ j 0, λ l1 = 1} we idetify every fuctio f λ covf with its parameter λ Λ which is icluded i B1 M. Therefore, the proof of Theorem E follows the same path as the proof of Theorem C but sice we assume boudedess some extra logarithms appearig i Theorem C ca be saved here. Ideed, for the loss fuctios class l covf = {l f : f covf }, we ca apply Theorem A with b l covf = 1, B = 1 ad the isomorphic fuctio ρ x c 0 log Mlog /. Note that the residual term of the o-exact oracle iequality satisfied by f ERM C i Theorem E is of the order of log Mlog / ad is thus uiformly better tha the optimal rate ψ C M for exact oracle iequalities i this setup. Up to logarithms, this residual term ca eve be the square of ψ C M whe M. I this example, the gap betwee the rates obtaied i the o-exact ad exact cases is ot due to a differece i the Berstei coditio, sice it holds i both: for ay f covf, El 2 f El f ad EL 2 f BEL f,
6 G. LECUÉ AND S. MENDELSON where the latter follows from the covexity of covf. Therefore, to explai this gap oe has to look elsewhere, ad it appears that the reaso is complexity. Ideed, it follows from Propositio 2.5 that up to some logarithmic factors, for ay λ > 0, E P P V lcovf λ λ. Therefore, the fixed poit λ ɛ defied i Theorem A ad associated with the loss fuctios class l covf will be of the order of 1/ up to some logarithmic factors. O the other had, if g 1,..., g M are idepedet, stadard Gaussia radom variables, cov {fx : f F } = { M j=1 λ j g j : λ B1 M} ad Y = g M+1, oe ca show that for ay µ 1/M, E P P V LcovF µ 1. Hece, the fixed poit µ of Theorem 6.1 associated with the excess loss class L covf is of the order of 1/ whe M. To coclude, i Covex aggregatio, the complexity of the localized classes V l covf λ ad V L covf µ for all λ > 0 plays a key role i uderstadig the differece betwee exact ad o-exact oracle iequalities. The results of [17] show that this is a geeric situatio, ad that oe should expect a gap betwee the two iequalities eve if the loss ad excess loss classes satisfy a Berstei coditio. 3. Model Selectio ad regularizatio. I this sectio, we obtai results o Model Selectio by applyig Theorem B. The first step is to show that ay procedure is a Model Selectio procedure for a particular class of models M ad some pealty fuctio. The, oe may apply Theorem B ad derived o-exact oracle iequalities for the pealized estimators see [16] for the termiology of this sectio associated with the class M ad the pealty. For the sake of completeess, we will show that the coverse is also true, ad ay pealized estimator is a procedure for some particular class F ad regularizig fuctio. Recall the setup of Model Selectio from [16]. Oe is give a collectio of models, deoted by M, ad a pealty fuctio pe : M R +. For every model m M, a ERM procedure is costructed: 3.1 f m Arg mi f m R f.
NON-EXACT ORACLE INEQUALITIES 7 The a model m is empirically selected by 3.2 m Arg mi R f m + pem. m M The pealized estimator studied i Model Selectio is f m, where, as before, we assume that the ifimum i 3.1 ad i 3.2 are achieved. The ext result shows that the pealized estimator f m is a. Lemma 3.1. Defie a class F ad a regularizig fuctio by 3.3 F = m ad regf = if pem. {m M:f m} m M The the pealized estimator f m satisfies f m Arg mi f F R f + regf Proof. By defiitio of f m, for ay m M ad f m, 3.4 R f m + pe m R f + pem. Therefore, give f F, 3.4 is true for ay m M such that f m. Takig the ifimum over all m M for which f m i the right had side of 3.4, we obtai 3.5 R f m + pe m R f + regf. Sice 3.5 holds for ay f F, thus the claim follows sice f m m ad thus reg f m pe m. It follows from Lemma 3.1 that ay Model Selectio procedure is a procedure over the fuctio class F ad for the regularizig fuctio defied i 3.3. The ext lemma proves the coverse. Lemma 3.2. Let F be a class of fuctio ad reg : F R + be a regularizig fuctio such that for ay f F, regf <. Assume that there exists f F miimizig f R f + regf over F. Deote by regf R + the rage of reg. For ay r regf defie the model m r = {f F : regf r}. If 3.6 M = {m r : r regf} ad pe : m r M r R +, f the is a pealized estimator for the class of models M edowed with the pealty fuctio pe..
8 G. LECUÉ AND S. MENDELSON Proof. Let r = reg f, set m = m r ad for ay m M, let f m be the output of a ERM performed i m. Thus, for ay f m, R f + reg Therefore, R f if f m R f ad is, f = f m. f R f + regf R f + r. f is a ERM over m, that It remais to show that m Arg mi m M R f m + pem, which is evidet because R f m + pe m = R f + reg f = mi R f + regf f F = mi mi R f + regf = mi R f + regf f m r mi r regf f F:regf r r regf mi mi R f + r = mi mi R f + pem r r regf f m r r regf f m r = mi mi R f + pem = mi R f m + pem. m M f m m M With this equivalece i mid, oe ca apply Theorem B to obtai results o procedures, the costruct a class M ad a pealty fuctio accordig to 3.6 ad fially use Lemma 3.2 to obtai oracle iequalities for the pealized estimator f m costructed i this framework. As a example of applicatio, we will use the Model Selectio problem studied i Chapter 8 of [16] for Vapik-Chervoekis models. Cosider the loss fuctio l f x, y = 1I fx y defied for ay x, y X {0, 1} ad measurable fuctio f : X {0, 1}. Oe is give a coutable set M of coutable models that is, a coutable set of measurable fuctios from X to {0, 1} ad ay m M has a fiite VC dimesio deoted by V m. I this setup, Theorem B ca be applied without ay extra assumptio. I particular, we will ot require ay Margi assumptio or Berstei coditio. Let the pealty fuctio be the oe used i pg. 285 of [16], that is, for ay m M, 3.7 pem = 2 2V m 1 + log/v m + log 2 ad recall that the pealized estimator f m satisfies the followig risk boud [16], pg. 285: π 3.8 ER f m if if Rf + pem + m M f m 2.
NON-EXACT ORACLE INEQUALITIES 9 To apply Theorem B, we cosider the followig regularizatio setup defied by F = m ad critf = mi Vm : f m m M m M ad to make thigs simpler assume that 3.9 M = {m k : k N} such that m 0 m 1 m 2. Theorem B allows oe to calibrate the regularizatio term usig the isomorphic profile of the family of loss fuctios classes l Fr r N where for ay r N ote that crit takes its values i N F r = {f F : critf r}. It follows from assumptio 3.9 that F r = m kr where kr = maxk N : V mk r. The isomorphic fuctio associated with the family of models F r r N ca be obtaied i this cotext followig the same strategy used to get Equatio 1.3 i the mai paper of this supplemetary material: we obtai, for ay r N, 3.10 λ ɛr = c 0V mkr log e/v mkr ɛ 2. Moreover, we ca check that b l Fr = 1, B r = 1 for ay r 0 ad that α is a valid choice for the auxiliary fuctio α sice critf, f F. We ca ow apply Theorem B: let 0 < x log ad 0 < ɛ < 1/2 ad f cosider the associated with the regularizig fuctio 3.11 regf = c 1V mf log e/v mf where mf = max m M : V m critf + 1. It follows from Theorem B that with probability greater tha 1 12 exp x, 3.12 R f + c 0 reg ɛ 2 f 1 + 3ɛ if Rf + c 1 regf f F + c 2x + 1. From this result, we ca ow derive a o-exact regularized oracle iequality for the pealized estimator associated with the class of models M ad the pealty fuctio pe as defied i 3.6: 3.13 M = {m r : r regf} ad pe m r = r, r regf where regf = {r 0,..., r N }, N = maxk N : V mk ad r i = c 1V m i log e/v m i ɛ 2, 0 i N,
10 G. LECUÉ AND S. MENDELSON for m N = m N, m i = max m M : V m < V m i+1, 0 i N 1. I other words, M is a the largest subset of M of models with strictly icreasig VC dimesio smaller tha ad with the largest possible models for each VC dimesio. Each oe of these models of VC dimesio V is the pealized by c 1 V loge/v /ɛ 2. We ca ow state a result for the pealized estimator associated with the class M ad the pealty fuctio pe. Theorem F. There exists some absolute costats c 1, c 2, c 3 ad c 4 such that the followig holds. Let M = {m 0,, m N } be a family of models such that m 0 m N ad V m0 < V m1 < < V mn where for ay m M, V m is the VC dimesio of m. Let 0 < ɛ < 1/2. Cosider the pealty fuctio pe : F R + defied by pem = c 1 V m log e/v m /ɛ 2. The the pealized estimator f m is such that for ay 0 < x log, with probability greater tha 1 12 exp x, R f m + c 2 pe m 1 + 3ɛ mi if Rf + c 3pem. m M f m I particular, it is iterestig to ote that, up to a logarithmic factor, the pealty fuctio i 3.7 is of the order of V/ whereas, i the same framework up to the structural assumptio 3.9, the pealty fuctio defied i Theorem F is of the order of V/. This differece comes from the Berstei coditios sice for the loss fuctios class: El 2 f BEl f, f F is trivially satisfied i the setup of Theorem F; whereas the Berstei coditio for the excess loss fuctios classes: EL 2 f,m BEL f,m, f m, m M where L f,m = l f l f m ad fm argmi f m Rf is ot true i geeral ad is somehow required to obtai fast rates for exact oracle iequalities. Fially, ote that a direct approach based o the computatio of isomorphic fuctios for the family of loss fuctios class l m m M would provide a way of costructig pealty fuctios ad obtaiig oracle iequalities for the pealized estimator associated with this pealty fuctio. This approach do ot require the structural assumptio 3.9. Nevertheless, we did ot prove such a result which would maily follow the same lie as the proof of Theorem B combied with the approach i [16] cf. Sectio 3.6 i [15] for such a result. Somehow, we foud more iterestig to prove that Model Selectio methods ca be see as regularized procedures ad that Theorem B, which was origially desiged for regularized estimators, ca also be used to prove results for pealized estimators. Refereces.
NON-EXACT ORACLE INEQUALITIES 11 [1] Fracis R. Bach. Cosistecy of trace orm miimizatio. J. Mach. Lear. Res., 9:1019 1048, 2008. [2] Floretia Buea, Yiyua She, ad Marte Wegkamp. Optimal selectio of reduced rak estimators of high-dimesioal matrices. arxiv:1004.2995v2, 2010. [3] Emmauel J. Cades ad Yaiv Pla. Matrix completio with oise. Proceedigs of IEEE, 2009. [4] Emmauel J. Cades ad Yaiv Pla. Tight oracle bouds for low-rak matrix recovery from a miimal umber of radom measuremets. arxiv:1001.0339. [5] Emmauel J. Cades ad Bejami Recht. Exact matrix completio via covex optimizatio. Foud. of Comput. Math., 9:717 772, 2008. [6] Emmauel J. Cades ad Bejami Recht. Exact matrix completio via covex optimizatio. Foud. Comput. Math., 96:717 772, 2009. [7] Emmauel J. Cades ad Terece Tao. The power of covex relaxatio: Near-optimal matrix completio. arxiv:0903.1476. [8] Stéphae Gaïffas ad Guillaume Lecué. Sharp oracle iequalities for highdimesioal matrix predictio. To appear i IEEE trasactio o iformatio theory. arxiv:1008.4886. [9] Christophe Giraud. Low rak multivariate regressio. arxiv:1009.5165. [10] David Gross. Recoverig low-rak matrices from few coefficiets i ay basis. ArXiv 0910.1879, 2009. [11] Raghuada Keshava, Adrea H., Motaari, ad Sewoog Oh. Matrix completio from oisy etries. arxiv:0906.2027, 2009. [12] Vladimir Koltchiskii. Vo euma etropy pealizatio ad low rak matrix estimatio. arxiv:1009.2439. [13] Vladimir Koltchiskii, Alexadre B. Tsybakov, ad Karim Louici. Nuclear orm pealizatio ad optimal rates for oisy low rak matrix completio. arxiv:1011.6256. [14] Guillaume Lecué. Empirical risk miimizatio is optimal for the covex aggregatio problem. 2011. [15] Guillaume Lecué. Iterplay betwee cocetratio, complexity ad geometry i learig theory with applicatios to high dimesioal data aalysis. Habilitatio à diriger des recherches. 2011. [16] Pascal Massart. Cocetratio iequalities ad model selectio, volume 1896 of Lecture Notes i Mathematics. Spriger, Berli, 2007. Lectures from the 33rd Summer School o Probability Theory held i Sait-Flour, July 6 23, 2003, With a foreword by Jea Picard. [17] Shahar Medelso. Oracle iequalities ad the isomorphic method. Uder preparatio. [18] Sahad Negahba ad Marti J. Waiwright. Restricted strog covexity ad weighted matrix completio: Optimal bouds with oise. arxiv:1009.2118. [19] Bejami Recht. A simpler approach to matrix completio. arxiv:0910.0651. [20] Agelika Rohde ad Alexadre Tsybakov. Estimatio of high-dimesioal low-rak matrice. To appear i A. Statist.. arxiv:0912.5338. [21] Alexadre Tsybakov. Optimal rate of aggregatio. I Computatioal Learig Theory ad Kerel Machies COLT-2003, volume 2777 of Lecture Notes i Artificial Itelligece, pages 303 313. Spriger, Heidelberg, 2003. E-mail: Guillaume.Lecue@uiv-mlv.fr E-mail: shahar@tx.techio.ac.il