Consistency and Convergence Rates of OneClass SVMs and Related Algorithms


 Aleesha Ellis
 1 years ago
 Views:
Transcription
1 Journal of Machine Learning Research Submitted 9/05; Revised 1/06; Published 5/06 Consistency and Convergence Rates of OneClass SVMs and Related Algorithms Régis Vert Laboratoire de Recherche en Informatique Bâtiment 490, Université ParisSud 91405, Orsay Cedex, France, and Masagroup 24 Boulevard de l Hôpital Paris, France JeanPhilippe Vert Center for Computational Biology Ecole des Mines de Paris 35 rue Saint Honoré 77300, Fontainebleau, France Editor: Bernhard Schölkopf Abstract We determine the asymptotic behaviour of the function computed by support vector machines SVM and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Nonasymptotic convergence bounds to this limit in the L 2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayesconsistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the oneclass SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, oneclass SVM, convex loss functions, kernel density estimation 1. Introduction Given n independent and identically distributed i.i.d. copies X 1,Y 1,...,X n,y n of a random variable X,Y R d { 1,1}, we study in this paper the limit and consistency of learning algorithms that solve the following problem: arg min f H σ { 1 n n i=1 φy i fx i +λ f 2 H σ }, 1 c 2006 Régis Vert and JeanPhilippe Vert.
2 VERT AND VERT where φ : R R is a convex loss function andh σ is the reproducing kernel Hilbert space RKHS of the normalized Gaussian radial basis function kernel denoted simply Gaussian kernel below: k σ x,x 1 x x 2 := d exp 2πσ 2σ 2, σ > 0. 2 This framework encompasses in particular the classical support vector machine SVM Boser et al., 1992 when φu = max1 u, 0 Theorem 6. Recent years have witnessed important theoretical advances aimed at understanding the behavior of such regularized algorithms when n tends to infinity and λ decreases to 0. In particular the consistency and convergence rates of the twoclass SVM see, e.g., Steinwart, 2002; Zhang, 2004; Steinwart and Scovel, 2004, and references therein have been studied in detail, as well as the shape of the asymptotic decision function Steinwart, 2003; Bartlett and Tewari, The case of more general convex loss functions has also attracted a lot of attention recently Zhang, 2004; Lugosi and Vayatis, 2004; Bartlett et al., 2006, and been shown to provide under general assumptions consistent procedure for the classification error. All results published so far, however, study the case where λ decreases as the number of points tends to infinity or, equivalently, where λσ d converges to 0 if one uses the classical nonnormalized version of the Gaussian kernel instead of 2. Although it seems natural to reduce regularization as more and more training data are available even more than natural, it is the spirit of regularization Tikhonov and Arsenin, 1977; Silverman, 1982, there is at least one important situation where λ is typically held fixed: the oneclass SVM Schölkopf et al., In that case, the goal is to estimate an αquantile, that is, a subset of R d of given probability α with minimum volume. The estimation is performed by thresholding the function output by the oneclass SVM, that is, the SVM 1 with only positive examples; in that case λ is supposed to determine the quantile level. 1 Although it is known that the fraction of examples in the selected region converges to the desired quantile level α Schölkopf et al., 2001, it is still an open question whether the region converges towards a quantile, that is, a region of minimum volume. Besides, most theoretical results about the consistency and convergence rates of twoclass SVM with vanishing regularization constant do not translate to the oneclass case, as we are precisely in the seldom situation where the SVM is used with a regularization term that does not vanish as the sample size increases. The main contribution of this paper is to show that Bayes consistency for the classification error can be obtained for algorithms that solve 1 without decreasing λ, if instead the bandwidth σ of the Gaussian kernel decreases at a suitable rate. We prove upper bounds on the convergence rate of the classification error towards the Bayes risk for a variety of functions φ and of distributions P, in particular for SVM Theorem 6. Moreover, we provide an explicit description of the function asymptotically output by the algorithms, and establish converge rates towards this limit for the L 2 norm Theorem 7. In particular, we show that the decision function output by the oneclass SVM converges towards the density to be estimated, truncated at the level 2λ Theorem 8; we finally show Theorem 9 that this implies the consistency of oneclass SVM as a density level set estimator for the excessmass functional Hartigan, This paper is organized as follows. In Section 2, we set the framework of this study and state the main results. The rest of the paper is devoted to the proofs and some extensions of these results. In Section 3, we provide a number of known and new properties of the Gaussian RKHS. Section 4 1. While the original formulation of the oneclass SVM involves a parameter ν, there is asymptotically a onetoone correspondence between λ and ν. 818
3 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS is devoted to the proof of the main theorem that describes the speed of convergence of the regularized φrisk of its empirical minimizer towards its minimum. This proof involves in particular a control of the sample error in this particular setting that is dealt with in Section 5. Section 6 relates the minimization of the regularized φrisk to more classical measures of performance, in particular classification error and L 2 distance to the limit. These results are discussed in more detail in Section 7 for the case of the 1 and 2SVM. Finally the proof of the consistency of the oneclass SVM as a density level set estimator is postponed to Section Notation and Main Results Let X,Y be a pair of random variables taking values in R d { 1,1}, with distribution P. We assume throughout this paper that the marginal distribution of X has a density ρ : R d R with respect to the Lebesgue measure, and that its support is included in a compact setx R d. Let η : R d [0,1] denote a measurable version of the conditional distribution of Y = 1 given X. The function 2η 1 then corresponds to the socalled regression function. The normalized Gaussian radial basis function RBF kernel k σ with bandwidth parameter σ > 0 is defined for any x,x R d R d by: 2 k σ x,x 1 x x 2 := d exp 2πσ 2σ 2, the corresponding reproducing kernel Hilbert space RKHS is denoted byh σ, with associated norm. H σ. Moreover let κ σ := k σ L = 1/ 2πσ d. 3 Several useful properties of this kernel and its RKHS are gathered in Section 3. Denoting bym the set of measurable realvalued functions on R d, we define several risks for functions f M : The classification error rate, usually ref R f := Psign fx Y, and the minimum achievable classification error rate overm is called the Bayes risk: R := inf R f. f M For a scalar λ > 0 fixed throughout this paper and a convex function φ : R R, the φrisk regularized by the RKHS norm is defined, for any σ > 0 and f H σ, by R φ,σ f := E P [φy f X]+λ f 2 H σ, and the minimum achievable R φ,σ risk overh σ is denoted by R φ,σ := inf f H σ R φ,σ f. 2. We refer the reader to Section 3.2 for a brief discussion on the relation between normalized/unnormalized Gaussian kernel. 819
4 VERT AND VERT Furthermore, for any real r 0, we know that φ is Lipschitz on [ r,r], and we denote by Lr the Lipschitz constant of the restriction of φ to the interval [ r, r]. For example, for the hinge loss φu = max0,1 u one can take Lr = 1, and for the squared hinge loss φu = max0,1 u 2 one can take Lr = 2r+ 1. Finally, the L 2 norm regularized φrisk is, for any f M : where, R φ,0 f := E P [φy f X]+λ f 2 L 2 Z f 2 L 2 := fx 2 dx [0,+ ], R d and the minimum achievable R φ,0 risk overm is denoted by R φ,0 := inf f M R φ,0f < As we shall see in the sequel, the above notation is consistent with the fact that R φ,0 is the pointwise limit of R φ,σ as σ tends to zero. Each of these risks has an empirical counterpart where the expectation with respect to P is replaced by an average over an i.i.d. sample T := {X 1,Y 1,...,X n,y n }. In particular, the following empirical version of R φ,σ will be used σ > 0, f H σ, R φ,σ f := 1 n n i=1 φy i f X i +λ f 2 H σ Furthermore, ˆf φ,σ denotes the minimizer of R φ,σ overh σ see Steinwart, 2005a, for a proof of existence and uniqueness of ˆf φ,σ. The main focus of this paper is the analysis of learning algorithms that minimize the empirical φrisk regularized by the RKHS norm R φ,σ, and their limit as the number of points tends to infinity and the kernel width σ decreases to 0 at a suitable rate when n tends to, λ being kept fixed. Roughly speaking, our main result shows that in this situation, the minimization of R φ,σ asymptotically amounts to minimizing R φ,0. This stems from the fact that the empirical average term in the definition of R φ,σ converges to its corresponding expectation, while the norm inh σ of a function f decreases to its L 2 norm when σ decreases to zero. To turn this intuition into a rigorous statement, we need a few more assumptions about the minimizer of R φ,0 and about P. First, we observe that the minimizer of R φ,0 is indeed welldefined and can often be explicitly computed the following lemma is part of Theorem 26 and is proved in Section 6.3: Lemma 1 Minimizer of R φ,0 For any x R d, let { f φ,0 x := argmin ρx[ηxφα+1 ηxφ α]+λα 2 }. α R Then f φ,0 is measurable and satisfies: R φ,0 fφ,0 = inf f M R φ,0f 820
5 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Second, let us recall the notion of modulus of continuity DeVore and Lorentz, 1993, p.44: Definition 2 Modulus of Continuity Let f be a Lebesgue measurable function from R d to R. Then its modulus of continuity in the L 1 norm is defined for any δ 0 as follows where t is the Euclidean norm of t R d. ω f,δ := sup f.+t f. L1, 4 0 t δ The main result of this paper, whose proof is postponed to Section 4, can now be stated as follows: Theorem 3 Main Result Let σ 1 > σ > 0, 0 < p < 2, δ > 0, and let ˆf φ,σ denote the minimizer of the R φ,σ risk overh σ, where φ is assumed to be convex. Assume that the marginal density ρ is bounded, and let M := sup x R d ρx. Then there exist constants K i i=1...4 depending only on p, δ, λ, d, and M such that the following holds with probability greater than 1 e x over the draw of the training data R φ,0 ˆf φ,σ R κσ φ0 φ,0 K 1 L λ 21 κσ φ0 + K 2 L λ σ + K 3 σ 2 σ K 4 ω f φ,0,σ 1, 4 2+p 1 σ[2+2 p1+δ]d 2+p d x n p n 5 where Lr still denotes the Lipschitz constant of φ on the interval [ r,r], for any r > 0. The first two terms in r.h.s. of 5 bound the estimation error also called sample error associated with the Gaussian RKHS, which naturally tends to be small when the number of training data increases and when the RKHS is small, i.e., when σ is large. As is usually the case in such variance/bias splittings, the variance term here depends on the dimension d of the input space. Note that it is also parametrized by both p and δ. These two parameters come from the bound 36 on covering numbers that is used to derive the estimation error bound 31. Both constants K 1 and K 2 depend on them, although we do not know the explicit dependency. The third term measures the error due to penalizing the L 2 norm of a fixed function inh σ1 by its. H σ norm, with 0 < σ < σ 1. This is a price to pay to get a small estimation error. As for the fourth term, it is a bound on the approximation error of the Gaussian RKHS. Note that, once λ and σ have been fixed, σ 1 remains a free variable parameterizing the bound itself. From 5, we can deduce the R φ,0 consistency of ˆf φ,σ for Lipschitz loss functions, as soon as f φ,0 is integrable, σ = o n 1/d+ε for some ε > 0, and σ 1 0 with σ/σ 1 0. Now, in order to highlight the type of convergence rates one can obtain from Theorem 3, let us assume that the φ loss function is Lipschitz on R e.g., take the hinge loss, and suppose that for some 0 β 1, c 1 > 0, 821
6 VERT AND VERT and for any h 0, f φ,0 satisfies the following inequality: 3 ω f φ,0,δ c 1 δ β. 6 Then the righthand side of 5 can be optimized w.r.t. σ 1, σ, p and δ by balancing the first, third and fourth terms the second term having always a better convergence rate than the first one. For any ε > 0, by choosing: δ = 1, ε p = 2 2d + dβ β, 2+β 1 4β+2+βd+ε σ =, n σ 1 = σ 2 2+β = 2 1 4β+2+βd+ε n the following rate of convergence is obtained: 2β R φ,0 ˆf φ,σ R φ,0 = O P 1 4β+2+βd+ε. n This shows in particular that, whatever the values of β and d, the convergence rate that can be derived from Theorem 3 is always slower than 1/ n, and it gets slower and slower as the dimension d increases. Theorem 3 shows that, when φ is convex, minimizing the R φ,σ risk for wellchosen width σ is a an algorithm consistent for the R φ,0 risk. In order to relate this consistency with more traditional measures of performance of learning algorithms, the next theorem shows that under a simple additional condition on φ, R φ,0 riskconsistency implies Bayes consistency: Theorem 4 Relating R φ,0 Consistency with Bayes Consistency If φ is convex, differentiable at 0, with φ 0 < 0, then for every sequence of functions f i i 1 M, lim R φ,0f i = R φ,0 = i + lim R f i = R i + This theorem results from a more general quantitative analysis of the relationship between the excess R φ,0 risk and the excess Rrisk Theorem 28, and is proved in Section 6.5. In order to state a refined version of it in the particular case of the support vector machine algorithm, we first need to introduce the notion of low density exponent: Definition 5 We say that a distribution P with marginal density ρ w.r.t. Lebesgue measure has a low density exponent γ 0 if there exists c 2,ε 0 0, 2 such that { } ε [0,ε 0 ], P x R d : ρx ε c 2 ε γ. 3. For instance, it can be shown that the indicator function of the unit ball in R d, albeit not continuous, satisfies 6 with β = 1., 822
7 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS We are now in position to state a quantitative relationship between the excess R φ,0 risk and the excess Rrisk in the case of support vector machines : Theorem 6 Consistency of SVM Let φ 1 α := max1 α,0 be the hinge loss function and let φ 2 α := max1 α,0 2 be the squared hinge loss function. Then for any distribution P with low density exponent γ, there exist constant K 1,K 2,r 1,r 2 0, 4 such that for any f M with an excess R φ1,0risk upper bounded by r 1 the following holds: R f R K 1 Rφ1,0 f R γ 2γ+1 φ 1,0, and if the excess regularized R φ2,0risk upper bounded by r 2 the following holds: R f R K 2 Rφ2,2 f R γ 2γ+1 φ 2,2, This theorem is proved in Section 7.3. In combination with Theorem 3, it states the consistency of SVM, and gives upper bounds on the convergence rates, for the first time in a situation where the effect of regularization does not vanish asymptotically. In fact, Theorem 6 is a particular case of a more general result Theorem 29 valid for a large class of convex loss functions. Section 6 is devoted to the analysis of the general case through the introduction of variational arguments, in the spirit of Bartlett et al Another consequence of the R φ,0 consistency of an algorithm is the L 2 convergence of the function output by the algorithm to the minimizer of the R φ,0 risk : Lemma 7 Relating R φ,0 Consistency with L 2 Consistency For any f M, the following holds: f f φ,0 2 L 2 1 λ Rφ,0 f R φ,0. This result is the third statement of Theorem 26, proved in Section 6.3. It is particularly relevant to study algorithms whose objective is not binary classification. Consider for example the oneclass SVM algorithm, which served as the initial motivation for this paper. Then we can state the following result, proved in Section 8.1 : Theorem 8 L 2 Consistency of OneClass SVM Let ρ λ denote the function obtained after truncating the density: { ρx ρ λ x := 2λ if ρx 2λ, 7 1 otherwise. Let ˆf σ denote the function output by the oneclass SVM: 1 ˆf σ := arg min f H σ n n i=1 φ f X i +λ f 2 H σ. Then, under the general conditions of Theorem 3, and assuming that lim h 0 ωρ λ,h = 0, for a wellcalibrated bandwidth σ. lim ˆf σ ρ λ L2 = 0, in probability, n + 823
8 VERT AND VERT In this and the next theorem, wellcalibrated refers to any choice of bandwidth σ that ensures R φ,0  consistency, as discussed after Theorem 3. A very interesting byproduct of this theorem is the consistency of the oneclass SVM algorithm for density level set estimation, which to the best of our knowledge has not been stated so far the proof being postponed to Section 8.2 : Theorem 9 Consistency of OneClass SVM for Density Level Set Estimation Let 0 < µ< 2λ< M, let C µ be the level set of the density function ρ at level µ: { } C µ := x R d : ρx µ, and Ĉ µ be the level set of 2λ ˆf σ at level µ: { Ĉ µ := x R d : ˆf σ x µ }, 2λ where ˆf σ is still the function output by the oneclass SVM. For any distribution Q, for any subset C of R d, define the excessmass of C with respect to Q as follows: H Q C := QC µlebc. Then, under the general assumptions of Theorem 3, and assuming that lim h 0 ωρ λ,h = 0, we have lim H PC µ H P Ĉµ = 0, in probability, n + for a wellcalibrated bandwidth σ. The excessmass functional was first introduced by Hartigan 1987 to assess the quality of density level set estimators. It is maximized by the true density level set C µ and acts as a risk functional in the oneclass framework. The proof of Theorem 9 is based on the following general result: if ˆρ is a density estimator converging to the true density ρ in the L 2 sense, then for any fixed 0 < µ < sup R d {ρ}, the excess mass of the level set of ˆρ at level µ converges to the excess mass of C µ. In other words, as is the case in the classification framework, plugin rules built on L 2  consistent density estimators are consistent with respect to the excess mass. 3. Some Properties of the Gaussian Kernel and its RKHS This section presents known and new results about the Gaussian kernel k σ and its associated RKHSH σ, that are useful for the proofs of our results. They concern the explicit description of the RKHS norm in terms of Fourier transforms, its relation with the L 2 norm, and some approximation properties of convolutions with the Gaussian kernel. They make use of basic properties of Fourier transforms which we now recall and which can be found in e.g. Folland, 1992, Chap. 7, p.205. For any f in L 1 R d, its Fourier transformf [ f] : R d R is defined by Z F [ f]ω := e i<x,ω> fxdx. R d If in additionf [ f] L 1 R d, f can be recovered fromf [ f] by the inverse Fourier formula: fx = 1 Z F 2π d [ f]ωe i<x,ω> dω. R d 824
9 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Finally Parseval s equality relates the L 2 norm of a function and its Fourier transform if f L 1 R d L 2 R d andf [ f] L 2 R d : f 2 L 2 = 1 2π d F [ f] 2 L Fourier Representation of the Gaussian RKHS For any u R d, the expression k σ u denotes k σ 0,u, with Fourier transform known to be: F [k σ ]ω = e σ2 ω The general study of translation invariant kernels provides an accurate characterization of their associated RKHS in terms of the their Fourier transform see, e.g., Matache and Matache, In the case of the Gaussian kernel, the following holds : Lemma 10 Characterization ofh σ LetC 0 R d denote the set of continuous functions on R d that vanish at infinity. The set { } ZRd H σ := f C 0 R d : f L 1 R d and F [ f]ω 2 e σ2 ω 2 2 dω < 10 is the RKHS associated with the Gaussian kernel k σ, and the associated dot product is given for any f,g H σ by f,g H σ = 1 ZRd F 2π d [ f]ωf [g]ω e σ2 ω 2 2 dω, 11 where a denotes the conjugate of a complex number a. In particular the associated norm is given for any f H σ by f 2 H σ = 1 ZRd F 2π d [ f]ω 2 e σ2 ω 2 2 dω. 12 This lemma readily implies several basic facts about Gaussian RKHS and their associated norms summarized in the next lemma. In particular, it shows that the family H σ σ>0 forms a nested collection of models, and that for any fixed function, the RKHS norm decreases to the L 2 norm as the kernel bandwidth decreases to 0. Lemma 11 The following statements hold: 1. For any 0 < σ < τ, Moreover, for any f H τ, and 2. For any τ > 0 and f H τ, H τ H σ L 2 R d. 13 f H τ f H σ f L f 2 H σ f 2 L 2 σ2 τ 2 f 2 H τ f 2 L lim f H σ 0 σ = f L
10 VERT AND VERT 3. For any σ > 0 and f H σ, f L κ σ f H σ. 17 Proof Equations 13 and 14 are direct consequences of the characterization of the Gaussian RKHS 12 and of the observation that 0 < σ < τ = e τ2 ω 2 2 e σ2 ω In order to prove 15, we derive from 12 and Parseval s equality 8: f 2 H σ f 2 L 2 = 1 Z [ ] F 2π d [ f]ω 2 e σ2 ω dω. 18 R d For any 0 u v, we have e u 1/u e v 1/v by convexity of e u, and therefore: f 2 H σ f 2 L 2 1 σ 2 Z [ ] F 2π d τ 2 [ f]ω 2 e τ2 ω dω, 19 R d which leads to 15. Equation 16 is now a direct consequence of 15. Finally, 17 is a classical bound derived from the observation that, for any x R d, fx = f,k σ H σ f H σ k σ H σ = κ σ f H σ. 3.2 Links with the NonNormalized Gaussian Kernel It is common in the machine learning literature to work with a nonnormalized version of the Gaussian RBF kernel, namely the kernel: x x k σ x,x 2 := exp 2σ From the relation k σ = κ σ k σ remember that κ σ is defined in Equation 3, we deduce from the general theory of RKHS thath σ = H σ and f H σ, f H σ = κ σ f H σ. 21 As a result, all statements about k σ and its RKHS easily translate into statements about k σ and its RKHS. For example, 14 shows that, for any 0 < σ < τ and f H τ, κτ σ d 2 f H τ f κ H σ σ = f τ H σ, a result that was shown recently Steinwart et al., 2004, Corollary
11 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS 3.3 Convolution with the Gaussian Kernel Besides its positive definiteness, the Gaussian kernel is commonly used as a kernel for function approximation through convolution. Recall Folland, 1992 that the convolution between two functions f,g L 1 R d is the function f g L 1 R d defined by Z f gx := f x ugudu R d and that it satisfies see e.g. Folland, 1992, Chap. 7, p.207 F [ f g] =F [ f]f [g]. 22 The convolution with a Gaussian RBF kernel is a technically convenient tool to map any square integrable function to a Gaussian RKHS. The following lemma which can also be found in Steinwart et al., 2004 gives several interesting properties on the RKHS and L 2 norms of functions smoothed by convolution: Lemma 12 For any σ > 0 and any f L 1 R d L 2 R d, k σ f H 2σ and k σ f H 2σ = f L2. 23 For any σ,τ > 0 that satisfy 0 < σ 2τ, and for any f L 1 R d L 2 R d, and k τ f H σ k τ f 2 H σ k τ f 2 L 2 σ2 2τ 2 f 2 L Proof Using 12, then 22 and 9, followed by Parseval s equality 8, we compute: k σ f 2 H 2σ = 1 Z F 2π d [k σ f]ω 2 e R d = 1 Z F 2π d [ f]ω 2 e R d = 1 Z F 2π d [ f]ω 2 dω R d = f 2 L 2. σ2 ω 2 σ2 ω 2 This proves the first two statements of the lemma. Now, because 0 < σ 2τ, previous first statement and 13 imply k τ f H 2τ H σ, dω e σ2 ω 2 dω 827
12 VERT AND VERT and, using 15 and 23, k τ f 2 H σ k τ f 2 L 2 σ2 k 2τ 2 τ f 2 H k τ f 2 L 2τ 2 σ2 2τ 2 k τ f 2 H 2τ = σ2 2τ 2 f 2 L 2. A final result we need is an estimate of the approximation properties of convolution with the Gaussian kernel. Convolving a function with a Gaussian kernel with decreasing bandwidth is known to provide an approximation of the original function under general conditions. For example, the assumption f L 1 R d is sufficient to show that k σ f f L1 goes to zero when σ goes to zero see, for example Devroye and Lugosi, 2000, page 79. We provide below a more quantitative estimate for the rate of this convergence under some assumption on the modulus of continuity of f see Definition 2, using methods from DeVore and Lorentz 1993, Chap.7, par.2, p.202. Lemma 13 Let f be a bounded function in L 1 R d. Then for all σ > 0, the following holds: k σ f f L1 1+ dω f,σ, where ω f,. denotes the modulus of continuity of f in the L 1 norm. Proof Using the fact that k σ is normalized, then Fubini s theorem and then the definition of ω, the following can be derived k σ f f L1 = = Z Z k σ t[ fx+t fx]dt dx R d R Z Z d k σ t fx+t fx dtdx R d R Z d [ Z ] k σ t fx+t fx dx dt R d R Z d k σ t f.+t f. L1 dt R Z d k σ tω f, t dt. R d Now, using the following subadditivity property of ω DeVore and Lorentz, 1993, Chap.2, par.7, p.44: ω f,δ 1 + δ 2 ω f,δ 1 +ω f,δ 2, δ 1,δ 2 > 0, the following inequality can be derived for any positive λ and δ: ω f,λδ 1+λω f,δ. 828
13 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Applying this and also CauchySchwarz inequality leads to Z k σ f f L1 1+ t ω f,σk σ tdt R d σ [ = ω f,σ 1+ 1 Z ] t k σ tdt σ R [ d ω f,σ 1+ 1 Z t 2 1 σ R d t 2 1] 2 2πσ d e 2σ 2 dt d 1 = ω f,σ 1+ 1 Z t 2 1 σ i=1 R d i t 2 2 2πσ d e 2σ 2 dt d 1 = ω f,σ 1+ 1 Z t 2 1 σ i=1 R d i e t2 2 i 2σ 2 dt i 2πσ [ = ω f,σ 1+ 1 Z 1] d u 2 1 e u2 2 2σ σ 2 du. Rd 2πσ The integral term is exactly the variance of a Gaussian random variable, namely σ 2. Hence we end up with k σ f f L1 1+ dω f,σ. 4. Proof of Theorem 3 The proof of Theorem 3 is based on the following decomposition of the excess R φ,0 risk for the minimizer of the R φ,σ risk: Lemma 14 For any 0 < σ < 2σ 1 and any sample X i,y i i=1,...,n, the minimizer ˆf φ,σ of R φ,σ satisfies: R φ,0 ˆf φ,σ R φ,0 [ R φ,σ ˆf φ,σ R ] φ,σ + [ R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 ] + [ 25 R φ,0 k σ1 f φ,0 R ] φ,0 Proof The excess R φ,0 risk decomposes as follows: R φ,0 ˆf φ,σ R φ,0 = [ ] R φ,0 ˆf φ,σ Rφ,σ ˆf φ,σ + [ R φ,σ ˆf φ,σ R ] φ,σ + [ R φ,σ R φ,σ k σ1 f φ,0 ] + [ R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 ] + [ R φ,0 k σ1 f φ,0 R φ,0]. 829
14 VERT AND VERT Note that by Lemma 12, k σ1 f φ,0 H 2σ 1 H σ L 2 R d which justifies the introduction of R φ,σ k σ1 f φ,0 and R φ,0 k σ1 f φ,0. Now, by definition of the different risks using 14, we have R φ,0 ˆf φ,σ Rφ,σ ˆf φ,σ = λ ˆf φ,σ 2 L 2 ˆf φ,σ 2 H σ 0, and R φ,σ R φ,σ k σ1 f φ,0 0. Hence, controlling R φ,0 ˆf φ,σ R φ,0 to prove Theorem 3 boils down to controlling each of the three terms arising in 25, which can be done as follows: The first term in 25 is usually referred to as the sample error or estimation error. The control of such quantities has been the topic of much research recently, including for example Tsybakov 1997; Mammen and Tsybakov 1999; Massart 2000; Bartlett et al. 2005; Koltchinskii 2003; Steinwart and Scovel Using estimates of local Rademacher complexities through covering numbers for the Gaussian RKHS due to Steinwart and Scovel 2004, we prove below the following result Lemma 15 For any σ > 0 small enough, let ˆf φ,σ be the minimizer of the R φ,σ risk on a sample of size n, where φ is a convex loss function. For any 0 < p < 2, δ > 0, and x 1, the following holds with probability at least 1 e x over the draw of the sample: R φ,σ ˆf φ,σ R κσ φ0 φ,σ K 1 L λ 21 κσ φ0 + K 2 L λ σ 4 2+p 1 σ[2+2 p1+δ]d 2+p d x n, where K 1 and K 2 are positive constants depending neither on σ, nor on n. The second term in 25 can be upper bounded by p n φ0σ 2 2σ 2 1. Indeed, using Lemma 12, and the fact that σ < 2σ 1, we have ] R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 = λ [ k σ1 f φ,0 2 H σ k σ1 f φ,0 2 L2 λσ2 2σ 2 f φ,0 2 L 2. 1 Since f φ,0 minimizes R φ,0, we have R φ,0 fφ,0 Rφ,0 0, which leads to f φ,0 2 L 2 φ0/λ. Therefore, we have R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 φ0σ2 2σ
15 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS The third term in 25 can be upper bounded by Indeed, 2λ f φ,0 L + L f φ,0 L M 1+ d ω f φ,0,σ 1. R φ,0 k σ1 f φ,0 R φ,0 f φ,0 = λ [ k σ1 f φ,0 2 L 2 f φ,0 2 L 2 ] + [ EP [ φ Ykσ1 f φ,0 X ] E P [ φ Y fφ,0 X ]] = λ k σ1 f φ,0 f φ,0,k σ1 f φ,0 + f φ,0 L 2 +E P [ φ Ykσ1 f φ,0 X φ Y f φ,0 X ]. Now, since k σ1 f φ,0 L f φ,0 L k σ1 L1 = f φ,0 L, then using Lemma 13, we obtain: R φ,0 k σ1 f φ,0 R φ,0 f φ,0 where M := sup x R d ρx is supposed to be finite. 2λ f φ,0 L k σ1 f φ,0 f φ,0 L1 +L f φ,0 L EP [ kσ1 f φ,0 X f φ,0 X ] 2λ f φ,0 L + L f φ,0 L M kσ1 f φ,0 f φ,0 L1 2λ f φ,0 L + L f φ,0 L M 1+ d ω f φ,0,σ 1, Now, Theorem 3 is proved by plugging the last three bounds in Proof of Lemma 15 Sample Error The present section is divided into two subsections: the first one presents the proof of Lemma 15, and the auxiliary lemmas that are used in it are then proved in the second subsection. 5.1 Proof of Lemma 15 In order to upper bound the sample error, it is useful to work with a set of functions as small as possible, in a meaning made rigorous below. Although we study algorithms that work on the whole RKHSH σ a priori, let us first show that we can drastically downsize it. Indeed, recall that the marginal distribution of P in X is assumed to have a support included in a compactx R d. The restriction of k σ tox, denoted by k X σ, is a positive definite kernel onx Aronszajn, 1950 with RKHS defined by: where f X denotes the restriction of f tox,and RKHS norm: For any f X H X σ H X σ := { f X : f H σ }, 26 f X H X σ, f X H X σ := inf { f H σ : f H σ and f X = f X }. 27 consider the following risks: R X φ,σ f X := E P X [ φ Y f X X ] + λ f X 2 H X σ, R X φ,σ f X := 1 n n i=1 φ Y i f X X i + λ f X 2 H X σ. We first show that the sample error is the same inh σ andh X σ : 831
Controllability and Observability of Partial Differential Equations: Some results and open problems
Controllability and Observability of Partial Differential Equations: Some results and open problems Enrique ZUAZUA Departamento de Matemáticas Universidad Autónoma 2849 Madrid. Spain. enrique.zuazua@uam.es
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationWhen Is There a Representer Theorem? Vector Versus Matrix Regularizers
Journal of Machine Learning Research 10 (2009) 25072529 Submitted 9/08; Revised 3/09; Published 11/09 When Is There a Representer Theorem? Vector Versus Matrix Regularizers Andreas Argyriou Department
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationOptimization with SparsityInducing Penalties. Contents
Foundations and Trends R in Machine Learning Vol. 4, No. 1 (2011) 1 106 c 2012 F. Bach, R. Jenatton, J. Mairal and G. Obozinski DOI: 10.1561/2200000015 Optimization with SparsityInducing Penalties By
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationProbability in High Dimension
Ramon van Handel Probability in High Dimension ORF 570 Lecture Notes Princeton University This version: June 30, 2014 Preface These lecture notes were written for the course ORF 570: Probability in High
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationGeneralized compact knapsacks, cyclic lattices, and efficient oneway functions
Generalized compact knapsacks, cyclic lattices, and efficient oneway functions Daniele Micciancio University of California, San Diego 9500 Gilman Drive La Jolla, CA 920930404, USA daniele@cs.ucsd.edu
More informationDecoding by Linear Programming
Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095
More informationClustering with Bregman Divergences
Journal of Machine Learning Research 6 (2005) 1705 1749 Submitted 10/03; Revised 2/05; Published 10/05 Clustering with Bregman Divergences Arindam Banerjee Srujana Merugu Department of Electrical and Computer
More informationBayesian Models of Graphs, Arrays and Other Exchangeable Random Structures
Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures Peter Orbanz and Daniel M. Roy Abstract. The natural habitat of most Bayesian methods is data represented by exchangeable sequences
More informationFrom Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David
More informationCOSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES
COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect
More informationONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK
ONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK Definition 1. A random walk on the integers with step distribution F and initial state x is a sequence S n of random variables whose increments are independent,
More informationSome Sharp Performance Bounds for Least Squares Regression with L 1 Regularization
Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization Tong Zhang Statistics Department Rutgers University, NJ tzhang@stat.rutgers.edu Abstract We derive sharp performance bounds
More informationf (x) π (dx), n f(x k ). (1) k=1
To appear Annals of Applied Probability 2005Vol. 0, No. 0, 1 8 ON THE ERGODICITY PROPERTIES OF SOME ADAPTIVE MCMC ALGORITHMS By Christophe Andrieu, and Éric Moulines University of Bristol and École Nationale
More informationBeyond Correlation: Extreme Comovements Between Financial Assets
Beyond Correlation: Extreme Comovements Between Financial Assets Roy Mashal Columbia University Assaf Zeevi Columbia University This version: October 14, 2002 Abstract This paper investigates the potential
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationA Tutorial on Spectral Clustering
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7276 Tübingen, Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationFourier Theoretic Probabilistic Inference over Permutations
Journal of Machine Learning Research 10 (2009) 9971070 Submitted 5/08; Revised 3/09; Published 5/09 Fourier Theoretic Probabilistic Inference over Permutations Jonathan Huang Robotics Institute Carnegie
More informationStructured Variable Selection with SparsityInducing Norms
Journal of Machine Learning Research 1 (011) 77784 Submitted 9/09; Revised 3/10; Published 10/11 Structured Variable Selection with SparsityInducing Norms Rodolphe Jenatton INRIA  SIERRA Projectteam,
More informationTwo faces of active learning
Two faces of active learning Sanjoy Dasgupta dasgupta@cs.ucsd.edu Abstract An active learner has a collection of data points, each with a label that is initially hidden but can be obtained at some cost.
More informationAn Elementary Introduction to Modern Convex Geometry
Flavors of Geometry MSRI Publications Volume 3, 997 An Elementary Introduction to Modern Convex Geometry KEITH BALL Contents Preface Lecture. Basic Notions 2 Lecture 2. Spherical Sections of the Cube 8
More informationA Kernel Method for the TwoSampleProblem
A Kernel Method for the TwoSampleProblem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt LudwigMaximiliansUniv. Munich, Germany kb@dbs.ifi.lmu.de
More informationObject Detection with Discriminatively Trained Part Based Models
1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures
More informationCompeting in the Dark: An Efficient Algorithm for Bandit Linear Optimization
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization Jacob Abernethy Computer Science Division UC Berkeley jake@cs.berkeley.edu Elad Hazan IBM Almaden hazan@us.ibm.com Alexander
More information