Consistency and Convergence Rates of OneClass SVMs and Related Algorithms


 Aleesha Ellis
 1 years ago
 Views:
Transcription
1 Journal of Machine Learning Research Submitted 9/05; Revised 1/06; Published 5/06 Consistency and Convergence Rates of OneClass SVMs and Related Algorithms Régis Vert Laboratoire de Recherche en Informatique Bâtiment 490, Université ParisSud 91405, Orsay Cedex, France, and Masagroup 24 Boulevard de l Hôpital Paris, France JeanPhilippe Vert Center for Computational Biology Ecole des Mines de Paris 35 rue Saint Honoré 77300, Fontainebleau, France Editor: Bernhard Schölkopf Abstract We determine the asymptotic behaviour of the function computed by support vector machines SVM and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Nonasymptotic convergence bounds to this limit in the L 2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayesconsistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the oneclass SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, oneclass SVM, convex loss functions, kernel density estimation 1. Introduction Given n independent and identically distributed i.i.d. copies X 1,Y 1,...,X n,y n of a random variable X,Y R d { 1,1}, we study in this paper the limit and consistency of learning algorithms that solve the following problem: arg min f H σ { 1 n n i=1 φy i fx i +λ f 2 H σ }, 1 c 2006 Régis Vert and JeanPhilippe Vert.
2 VERT AND VERT where φ : R R is a convex loss function andh σ is the reproducing kernel Hilbert space RKHS of the normalized Gaussian radial basis function kernel denoted simply Gaussian kernel below: k σ x,x 1 x x 2 := d exp 2πσ 2σ 2, σ > 0. 2 This framework encompasses in particular the classical support vector machine SVM Boser et al., 1992 when φu = max1 u, 0 Theorem 6. Recent years have witnessed important theoretical advances aimed at understanding the behavior of such regularized algorithms when n tends to infinity and λ decreases to 0. In particular the consistency and convergence rates of the twoclass SVM see, e.g., Steinwart, 2002; Zhang, 2004; Steinwart and Scovel, 2004, and references therein have been studied in detail, as well as the shape of the asymptotic decision function Steinwart, 2003; Bartlett and Tewari, The case of more general convex loss functions has also attracted a lot of attention recently Zhang, 2004; Lugosi and Vayatis, 2004; Bartlett et al., 2006, and been shown to provide under general assumptions consistent procedure for the classification error. All results published so far, however, study the case where λ decreases as the number of points tends to infinity or, equivalently, where λσ d converges to 0 if one uses the classical nonnormalized version of the Gaussian kernel instead of 2. Although it seems natural to reduce regularization as more and more training data are available even more than natural, it is the spirit of regularization Tikhonov and Arsenin, 1977; Silverman, 1982, there is at least one important situation where λ is typically held fixed: the oneclass SVM Schölkopf et al., In that case, the goal is to estimate an αquantile, that is, a subset of R d of given probability α with minimum volume. The estimation is performed by thresholding the function output by the oneclass SVM, that is, the SVM 1 with only positive examples; in that case λ is supposed to determine the quantile level. 1 Although it is known that the fraction of examples in the selected region converges to the desired quantile level α Schölkopf et al., 2001, it is still an open question whether the region converges towards a quantile, that is, a region of minimum volume. Besides, most theoretical results about the consistency and convergence rates of twoclass SVM with vanishing regularization constant do not translate to the oneclass case, as we are precisely in the seldom situation where the SVM is used with a regularization term that does not vanish as the sample size increases. The main contribution of this paper is to show that Bayes consistency for the classification error can be obtained for algorithms that solve 1 without decreasing λ, if instead the bandwidth σ of the Gaussian kernel decreases at a suitable rate. We prove upper bounds on the convergence rate of the classification error towards the Bayes risk for a variety of functions φ and of distributions P, in particular for SVM Theorem 6. Moreover, we provide an explicit description of the function asymptotically output by the algorithms, and establish converge rates towards this limit for the L 2 norm Theorem 7. In particular, we show that the decision function output by the oneclass SVM converges towards the density to be estimated, truncated at the level 2λ Theorem 8; we finally show Theorem 9 that this implies the consistency of oneclass SVM as a density level set estimator for the excessmass functional Hartigan, This paper is organized as follows. In Section 2, we set the framework of this study and state the main results. The rest of the paper is devoted to the proofs and some extensions of these results. In Section 3, we provide a number of known and new properties of the Gaussian RKHS. Section 4 1. While the original formulation of the oneclass SVM involves a parameter ν, there is asymptotically a onetoone correspondence between λ and ν. 818
3 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS is devoted to the proof of the main theorem that describes the speed of convergence of the regularized φrisk of its empirical minimizer towards its minimum. This proof involves in particular a control of the sample error in this particular setting that is dealt with in Section 5. Section 6 relates the minimization of the regularized φrisk to more classical measures of performance, in particular classification error and L 2 distance to the limit. These results are discussed in more detail in Section 7 for the case of the 1 and 2SVM. Finally the proof of the consistency of the oneclass SVM as a density level set estimator is postponed to Section Notation and Main Results Let X,Y be a pair of random variables taking values in R d { 1,1}, with distribution P. We assume throughout this paper that the marginal distribution of X has a density ρ : R d R with respect to the Lebesgue measure, and that its support is included in a compact setx R d. Let η : R d [0,1] denote a measurable version of the conditional distribution of Y = 1 given X. The function 2η 1 then corresponds to the socalled regression function. The normalized Gaussian radial basis function RBF kernel k σ with bandwidth parameter σ > 0 is defined for any x,x R d R d by: 2 k σ x,x 1 x x 2 := d exp 2πσ 2σ 2, the corresponding reproducing kernel Hilbert space RKHS is denoted byh σ, with associated norm. H σ. Moreover let κ σ := k σ L = 1/ 2πσ d. 3 Several useful properties of this kernel and its RKHS are gathered in Section 3. Denoting bym the set of measurable realvalued functions on R d, we define several risks for functions f M : The classification error rate, usually ref R f := Psign fx Y, and the minimum achievable classification error rate overm is called the Bayes risk: R := inf R f. f M For a scalar λ > 0 fixed throughout this paper and a convex function φ : R R, the φrisk regularized by the RKHS norm is defined, for any σ > 0 and f H σ, by R φ,σ f := E P [φy f X]+λ f 2 H σ, and the minimum achievable R φ,σ risk overh σ is denoted by R φ,σ := inf f H σ R φ,σ f. 2. We refer the reader to Section 3.2 for a brief discussion on the relation between normalized/unnormalized Gaussian kernel. 819
4 VERT AND VERT Furthermore, for any real r 0, we know that φ is Lipschitz on [ r,r], and we denote by Lr the Lipschitz constant of the restriction of φ to the interval [ r, r]. For example, for the hinge loss φu = max0,1 u one can take Lr = 1, and for the squared hinge loss φu = max0,1 u 2 one can take Lr = 2r+ 1. Finally, the L 2 norm regularized φrisk is, for any f M : where, R φ,0 f := E P [φy f X]+λ f 2 L 2 Z f 2 L 2 := fx 2 dx [0,+ ], R d and the minimum achievable R φ,0 risk overm is denoted by R φ,0 := inf f M R φ,0f < As we shall see in the sequel, the above notation is consistent with the fact that R φ,0 is the pointwise limit of R φ,σ as σ tends to zero. Each of these risks has an empirical counterpart where the expectation with respect to P is replaced by an average over an i.i.d. sample T := {X 1,Y 1,...,X n,y n }. In particular, the following empirical version of R φ,σ will be used σ > 0, f H σ, R φ,σ f := 1 n n i=1 φy i f X i +λ f 2 H σ Furthermore, ˆf φ,σ denotes the minimizer of R φ,σ overh σ see Steinwart, 2005a, for a proof of existence and uniqueness of ˆf φ,σ. The main focus of this paper is the analysis of learning algorithms that minimize the empirical φrisk regularized by the RKHS norm R φ,σ, and their limit as the number of points tends to infinity and the kernel width σ decreases to 0 at a suitable rate when n tends to, λ being kept fixed. Roughly speaking, our main result shows that in this situation, the minimization of R φ,σ asymptotically amounts to minimizing R φ,0. This stems from the fact that the empirical average term in the definition of R φ,σ converges to its corresponding expectation, while the norm inh σ of a function f decreases to its L 2 norm when σ decreases to zero. To turn this intuition into a rigorous statement, we need a few more assumptions about the minimizer of R φ,0 and about P. First, we observe that the minimizer of R φ,0 is indeed welldefined and can often be explicitly computed the following lemma is part of Theorem 26 and is proved in Section 6.3: Lemma 1 Minimizer of R φ,0 For any x R d, let { f φ,0 x := argmin ρx[ηxφα+1 ηxφ α]+λα 2 }. α R Then f φ,0 is measurable and satisfies: R φ,0 fφ,0 = inf f M R φ,0f 820
5 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Second, let us recall the notion of modulus of continuity DeVore and Lorentz, 1993, p.44: Definition 2 Modulus of Continuity Let f be a Lebesgue measurable function from R d to R. Then its modulus of continuity in the L 1 norm is defined for any δ 0 as follows where t is the Euclidean norm of t R d. ω f,δ := sup f.+t f. L1, 4 0 t δ The main result of this paper, whose proof is postponed to Section 4, can now be stated as follows: Theorem 3 Main Result Let σ 1 > σ > 0, 0 < p < 2, δ > 0, and let ˆf φ,σ denote the minimizer of the R φ,σ risk overh σ, where φ is assumed to be convex. Assume that the marginal density ρ is bounded, and let M := sup x R d ρx. Then there exist constants K i i=1...4 depending only on p, δ, λ, d, and M such that the following holds with probability greater than 1 e x over the draw of the training data R φ,0 ˆf φ,σ R κσ φ0 φ,0 K 1 L λ 21 κσ φ0 + K 2 L λ σ + K 3 σ 2 σ K 4 ω f φ,0,σ 1, 4 2+p 1 σ[2+2 p1+δ]d 2+p d x n p n 5 where Lr still denotes the Lipschitz constant of φ on the interval [ r,r], for any r > 0. The first two terms in r.h.s. of 5 bound the estimation error also called sample error associated with the Gaussian RKHS, which naturally tends to be small when the number of training data increases and when the RKHS is small, i.e., when σ is large. As is usually the case in such variance/bias splittings, the variance term here depends on the dimension d of the input space. Note that it is also parametrized by both p and δ. These two parameters come from the bound 36 on covering numbers that is used to derive the estimation error bound 31. Both constants K 1 and K 2 depend on them, although we do not know the explicit dependency. The third term measures the error due to penalizing the L 2 norm of a fixed function inh σ1 by its. H σ norm, with 0 < σ < σ 1. This is a price to pay to get a small estimation error. As for the fourth term, it is a bound on the approximation error of the Gaussian RKHS. Note that, once λ and σ have been fixed, σ 1 remains a free variable parameterizing the bound itself. From 5, we can deduce the R φ,0 consistency of ˆf φ,σ for Lipschitz loss functions, as soon as f φ,0 is integrable, σ = o n 1/d+ε for some ε > 0, and σ 1 0 with σ/σ 1 0. Now, in order to highlight the type of convergence rates one can obtain from Theorem 3, let us assume that the φ loss function is Lipschitz on R e.g., take the hinge loss, and suppose that for some 0 β 1, c 1 > 0, 821
6 VERT AND VERT and for any h 0, f φ,0 satisfies the following inequality: 3 ω f φ,0,δ c 1 δ β. 6 Then the righthand side of 5 can be optimized w.r.t. σ 1, σ, p and δ by balancing the first, third and fourth terms the second term having always a better convergence rate than the first one. For any ε > 0, by choosing: δ = 1, ε p = 2 2d + dβ β, 2+β 1 4β+2+βd+ε σ =, n σ 1 = σ 2 2+β = 2 1 4β+2+βd+ε n the following rate of convergence is obtained: 2β R φ,0 ˆf φ,σ R φ,0 = O P 1 4β+2+βd+ε. n This shows in particular that, whatever the values of β and d, the convergence rate that can be derived from Theorem 3 is always slower than 1/ n, and it gets slower and slower as the dimension d increases. Theorem 3 shows that, when φ is convex, minimizing the R φ,σ risk for wellchosen width σ is a an algorithm consistent for the R φ,0 risk. In order to relate this consistency with more traditional measures of performance of learning algorithms, the next theorem shows that under a simple additional condition on φ, R φ,0 riskconsistency implies Bayes consistency: Theorem 4 Relating R φ,0 Consistency with Bayes Consistency If φ is convex, differentiable at 0, with φ 0 < 0, then for every sequence of functions f i i 1 M, lim R φ,0f i = R φ,0 = i + lim R f i = R i + This theorem results from a more general quantitative analysis of the relationship between the excess R φ,0 risk and the excess Rrisk Theorem 28, and is proved in Section 6.5. In order to state a refined version of it in the particular case of the support vector machine algorithm, we first need to introduce the notion of low density exponent: Definition 5 We say that a distribution P with marginal density ρ w.r.t. Lebesgue measure has a low density exponent γ 0 if there exists c 2,ε 0 0, 2 such that { } ε [0,ε 0 ], P x R d : ρx ε c 2 ε γ. 3. For instance, it can be shown that the indicator function of the unit ball in R d, albeit not continuous, satisfies 6 with β = 1., 822
7 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS We are now in position to state a quantitative relationship between the excess R φ,0 risk and the excess Rrisk in the case of support vector machines : Theorem 6 Consistency of SVM Let φ 1 α := max1 α,0 be the hinge loss function and let φ 2 α := max1 α,0 2 be the squared hinge loss function. Then for any distribution P with low density exponent γ, there exist constant K 1,K 2,r 1,r 2 0, 4 such that for any f M with an excess R φ1,0risk upper bounded by r 1 the following holds: R f R K 1 Rφ1,0 f R γ 2γ+1 φ 1,0, and if the excess regularized R φ2,0risk upper bounded by r 2 the following holds: R f R K 2 Rφ2,2 f R γ 2γ+1 φ 2,2, This theorem is proved in Section 7.3. In combination with Theorem 3, it states the consistency of SVM, and gives upper bounds on the convergence rates, for the first time in a situation where the effect of regularization does not vanish asymptotically. In fact, Theorem 6 is a particular case of a more general result Theorem 29 valid for a large class of convex loss functions. Section 6 is devoted to the analysis of the general case through the introduction of variational arguments, in the spirit of Bartlett et al Another consequence of the R φ,0 consistency of an algorithm is the L 2 convergence of the function output by the algorithm to the minimizer of the R φ,0 risk : Lemma 7 Relating R φ,0 Consistency with L 2 Consistency For any f M, the following holds: f f φ,0 2 L 2 1 λ Rφ,0 f R φ,0. This result is the third statement of Theorem 26, proved in Section 6.3. It is particularly relevant to study algorithms whose objective is not binary classification. Consider for example the oneclass SVM algorithm, which served as the initial motivation for this paper. Then we can state the following result, proved in Section 8.1 : Theorem 8 L 2 Consistency of OneClass SVM Let ρ λ denote the function obtained after truncating the density: { ρx ρ λ x := 2λ if ρx 2λ, 7 1 otherwise. Let ˆf σ denote the function output by the oneclass SVM: 1 ˆf σ := arg min f H σ n n i=1 φ f X i +λ f 2 H σ. Then, under the general conditions of Theorem 3, and assuming that lim h 0 ωρ λ,h = 0, for a wellcalibrated bandwidth σ. lim ˆf σ ρ λ L2 = 0, in probability, n + 823
8 VERT AND VERT In this and the next theorem, wellcalibrated refers to any choice of bandwidth σ that ensures R φ,0  consistency, as discussed after Theorem 3. A very interesting byproduct of this theorem is the consistency of the oneclass SVM algorithm for density level set estimation, which to the best of our knowledge has not been stated so far the proof being postponed to Section 8.2 : Theorem 9 Consistency of OneClass SVM for Density Level Set Estimation Let 0 < µ< 2λ< M, let C µ be the level set of the density function ρ at level µ: { } C µ := x R d : ρx µ, and Ĉ µ be the level set of 2λ ˆf σ at level µ: { Ĉ µ := x R d : ˆf σ x µ }, 2λ where ˆf σ is still the function output by the oneclass SVM. For any distribution Q, for any subset C of R d, define the excessmass of C with respect to Q as follows: H Q C := QC µlebc. Then, under the general assumptions of Theorem 3, and assuming that lim h 0 ωρ λ,h = 0, we have lim H PC µ H P Ĉµ = 0, in probability, n + for a wellcalibrated bandwidth σ. The excessmass functional was first introduced by Hartigan 1987 to assess the quality of density level set estimators. It is maximized by the true density level set C µ and acts as a risk functional in the oneclass framework. The proof of Theorem 9 is based on the following general result: if ˆρ is a density estimator converging to the true density ρ in the L 2 sense, then for any fixed 0 < µ < sup R d {ρ}, the excess mass of the level set of ˆρ at level µ converges to the excess mass of C µ. In other words, as is the case in the classification framework, plugin rules built on L 2  consistent density estimators are consistent with respect to the excess mass. 3. Some Properties of the Gaussian Kernel and its RKHS This section presents known and new results about the Gaussian kernel k σ and its associated RKHSH σ, that are useful for the proofs of our results. They concern the explicit description of the RKHS norm in terms of Fourier transforms, its relation with the L 2 norm, and some approximation properties of convolutions with the Gaussian kernel. They make use of basic properties of Fourier transforms which we now recall and which can be found in e.g. Folland, 1992, Chap. 7, p.205. For any f in L 1 R d, its Fourier transformf [ f] : R d R is defined by Z F [ f]ω := e i<x,ω> fxdx. R d If in additionf [ f] L 1 R d, f can be recovered fromf [ f] by the inverse Fourier formula: fx = 1 Z F 2π d [ f]ωe i<x,ω> dω. R d 824
9 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Finally Parseval s equality relates the L 2 norm of a function and its Fourier transform if f L 1 R d L 2 R d andf [ f] L 2 R d : f 2 L 2 = 1 2π d F [ f] 2 L Fourier Representation of the Gaussian RKHS For any u R d, the expression k σ u denotes k σ 0,u, with Fourier transform known to be: F [k σ ]ω = e σ2 ω The general study of translation invariant kernels provides an accurate characterization of their associated RKHS in terms of the their Fourier transform see, e.g., Matache and Matache, In the case of the Gaussian kernel, the following holds : Lemma 10 Characterization ofh σ LetC 0 R d denote the set of continuous functions on R d that vanish at infinity. The set { } ZRd H σ := f C 0 R d : f L 1 R d and F [ f]ω 2 e σ2 ω 2 2 dω < 10 is the RKHS associated with the Gaussian kernel k σ, and the associated dot product is given for any f,g H σ by f,g H σ = 1 ZRd F 2π d [ f]ωf [g]ω e σ2 ω 2 2 dω, 11 where a denotes the conjugate of a complex number a. In particular the associated norm is given for any f H σ by f 2 H σ = 1 ZRd F 2π d [ f]ω 2 e σ2 ω 2 2 dω. 12 This lemma readily implies several basic facts about Gaussian RKHS and their associated norms summarized in the next lemma. In particular, it shows that the family H σ σ>0 forms a nested collection of models, and that for any fixed function, the RKHS norm decreases to the L 2 norm as the kernel bandwidth decreases to 0. Lemma 11 The following statements hold: 1. For any 0 < σ < τ, Moreover, for any f H τ, and 2. For any τ > 0 and f H τ, H τ H σ L 2 R d. 13 f H τ f H σ f L f 2 H σ f 2 L 2 σ2 τ 2 f 2 H τ f 2 L lim f H σ 0 σ = f L
10 VERT AND VERT 3. For any σ > 0 and f H σ, f L κ σ f H σ. 17 Proof Equations 13 and 14 are direct consequences of the characterization of the Gaussian RKHS 12 and of the observation that 0 < σ < τ = e τ2 ω 2 2 e σ2 ω In order to prove 15, we derive from 12 and Parseval s equality 8: f 2 H σ f 2 L 2 = 1 Z [ ] F 2π d [ f]ω 2 e σ2 ω dω. 18 R d For any 0 u v, we have e u 1/u e v 1/v by convexity of e u, and therefore: f 2 H σ f 2 L 2 1 σ 2 Z [ ] F 2π d τ 2 [ f]ω 2 e τ2 ω dω, 19 R d which leads to 15. Equation 16 is now a direct consequence of 15. Finally, 17 is a classical bound derived from the observation that, for any x R d, fx = f,k σ H σ f H σ k σ H σ = κ σ f H σ. 3.2 Links with the NonNormalized Gaussian Kernel It is common in the machine learning literature to work with a nonnormalized version of the Gaussian RBF kernel, namely the kernel: x x k σ x,x 2 := exp 2σ From the relation k σ = κ σ k σ remember that κ σ is defined in Equation 3, we deduce from the general theory of RKHS thath σ = H σ and f H σ, f H σ = κ σ f H σ. 21 As a result, all statements about k σ and its RKHS easily translate into statements about k σ and its RKHS. For example, 14 shows that, for any 0 < σ < τ and f H τ, κτ σ d 2 f H τ f κ H σ σ = f τ H σ, a result that was shown recently Steinwart et al., 2004, Corollary
11 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS 3.3 Convolution with the Gaussian Kernel Besides its positive definiteness, the Gaussian kernel is commonly used as a kernel for function approximation through convolution. Recall Folland, 1992 that the convolution between two functions f,g L 1 R d is the function f g L 1 R d defined by Z f gx := f x ugudu R d and that it satisfies see e.g. Folland, 1992, Chap. 7, p.207 F [ f g] =F [ f]f [g]. 22 The convolution with a Gaussian RBF kernel is a technically convenient tool to map any square integrable function to a Gaussian RKHS. The following lemma which can also be found in Steinwart et al., 2004 gives several interesting properties on the RKHS and L 2 norms of functions smoothed by convolution: Lemma 12 For any σ > 0 and any f L 1 R d L 2 R d, k σ f H 2σ and k σ f H 2σ = f L2. 23 For any σ,τ > 0 that satisfy 0 < σ 2τ, and for any f L 1 R d L 2 R d, and k τ f H σ k τ f 2 H σ k τ f 2 L 2 σ2 2τ 2 f 2 L Proof Using 12, then 22 and 9, followed by Parseval s equality 8, we compute: k σ f 2 H 2σ = 1 Z F 2π d [k σ f]ω 2 e R d = 1 Z F 2π d [ f]ω 2 e R d = 1 Z F 2π d [ f]ω 2 dω R d = f 2 L 2. σ2 ω 2 σ2 ω 2 This proves the first two statements of the lemma. Now, because 0 < σ 2τ, previous first statement and 13 imply k τ f H 2τ H σ, dω e σ2 ω 2 dω 827
12 VERT AND VERT and, using 15 and 23, k τ f 2 H σ k τ f 2 L 2 σ2 k 2τ 2 τ f 2 H k τ f 2 L 2τ 2 σ2 2τ 2 k τ f 2 H 2τ = σ2 2τ 2 f 2 L 2. A final result we need is an estimate of the approximation properties of convolution with the Gaussian kernel. Convolving a function with a Gaussian kernel with decreasing bandwidth is known to provide an approximation of the original function under general conditions. For example, the assumption f L 1 R d is sufficient to show that k σ f f L1 goes to zero when σ goes to zero see, for example Devroye and Lugosi, 2000, page 79. We provide below a more quantitative estimate for the rate of this convergence under some assumption on the modulus of continuity of f see Definition 2, using methods from DeVore and Lorentz 1993, Chap.7, par.2, p.202. Lemma 13 Let f be a bounded function in L 1 R d. Then for all σ > 0, the following holds: k σ f f L1 1+ dω f,σ, where ω f,. denotes the modulus of continuity of f in the L 1 norm. Proof Using the fact that k σ is normalized, then Fubini s theorem and then the definition of ω, the following can be derived k σ f f L1 = = Z Z k σ t[ fx+t fx]dt dx R d R Z Z d k σ t fx+t fx dtdx R d R Z d [ Z ] k σ t fx+t fx dx dt R d R Z d k σ t f.+t f. L1 dt R Z d k σ tω f, t dt. R d Now, using the following subadditivity property of ω DeVore and Lorentz, 1993, Chap.2, par.7, p.44: ω f,δ 1 + δ 2 ω f,δ 1 +ω f,δ 2, δ 1,δ 2 > 0, the following inequality can be derived for any positive λ and δ: ω f,λδ 1+λω f,δ. 828
13 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Applying this and also CauchySchwarz inequality leads to Z k σ f f L1 1+ t ω f,σk σ tdt R d σ [ = ω f,σ 1+ 1 Z ] t k σ tdt σ R [ d ω f,σ 1+ 1 Z t 2 1 σ R d t 2 1] 2 2πσ d e 2σ 2 dt d 1 = ω f,σ 1+ 1 Z t 2 1 σ i=1 R d i t 2 2 2πσ d e 2σ 2 dt d 1 = ω f,σ 1+ 1 Z t 2 1 σ i=1 R d i e t2 2 i 2σ 2 dt i 2πσ [ = ω f,σ 1+ 1 Z 1] d u 2 1 e u2 2 2σ σ 2 du. Rd 2πσ The integral term is exactly the variance of a Gaussian random variable, namely σ 2. Hence we end up with k σ f f L1 1+ dω f,σ. 4. Proof of Theorem 3 The proof of Theorem 3 is based on the following decomposition of the excess R φ,0 risk for the minimizer of the R φ,σ risk: Lemma 14 For any 0 < σ < 2σ 1 and any sample X i,y i i=1,...,n, the minimizer ˆf φ,σ of R φ,σ satisfies: R φ,0 ˆf φ,σ R φ,0 [ R φ,σ ˆf φ,σ R ] φ,σ + [ R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 ] + [ 25 R φ,0 k σ1 f φ,0 R ] φ,0 Proof The excess R φ,0 risk decomposes as follows: R φ,0 ˆf φ,σ R φ,0 = [ ] R φ,0 ˆf φ,σ Rφ,σ ˆf φ,σ + [ R φ,σ ˆf φ,σ R ] φ,σ + [ R φ,σ R φ,σ k σ1 f φ,0 ] + [ R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 ] + [ R φ,0 k σ1 f φ,0 R φ,0]. 829
14 VERT AND VERT Note that by Lemma 12, k σ1 f φ,0 H 2σ 1 H σ L 2 R d which justifies the introduction of R φ,σ k σ1 f φ,0 and R φ,0 k σ1 f φ,0. Now, by definition of the different risks using 14, we have R φ,0 ˆf φ,σ Rφ,σ ˆf φ,σ = λ ˆf φ,σ 2 L 2 ˆf φ,σ 2 H σ 0, and R φ,σ R φ,σ k σ1 f φ,0 0. Hence, controlling R φ,0 ˆf φ,σ R φ,0 to prove Theorem 3 boils down to controlling each of the three terms arising in 25, which can be done as follows: The first term in 25 is usually referred to as the sample error or estimation error. The control of such quantities has been the topic of much research recently, including for example Tsybakov 1997; Mammen and Tsybakov 1999; Massart 2000; Bartlett et al. 2005; Koltchinskii 2003; Steinwart and Scovel Using estimates of local Rademacher complexities through covering numbers for the Gaussian RKHS due to Steinwart and Scovel 2004, we prove below the following result Lemma 15 For any σ > 0 small enough, let ˆf φ,σ be the minimizer of the R φ,σ risk on a sample of size n, where φ is a convex loss function. For any 0 < p < 2, δ > 0, and x 1, the following holds with probability at least 1 e x over the draw of the sample: R φ,σ ˆf φ,σ R κσ φ0 φ,σ K 1 L λ 21 κσ φ0 + K 2 L λ σ 4 2+p 1 σ[2+2 p1+δ]d 2+p d x n, where K 1 and K 2 are positive constants depending neither on σ, nor on n. The second term in 25 can be upper bounded by p n φ0σ 2 2σ 2 1. Indeed, using Lemma 12, and the fact that σ < 2σ 1, we have ] R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 = λ [ k σ1 f φ,0 2 H σ k σ1 f φ,0 2 L2 λσ2 2σ 2 f φ,0 2 L 2. 1 Since f φ,0 minimizes R φ,0, we have R φ,0 fφ,0 Rφ,0 0, which leads to f φ,0 2 L 2 φ0/λ. Therefore, we have R φ,σ k σ1 f φ,0 R φ,0 k σ1 f φ,0 φ0σ2 2σ
15 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS The third term in 25 can be upper bounded by Indeed, 2λ f φ,0 L + L f φ,0 L M 1+ d ω f φ,0,σ 1. R φ,0 k σ1 f φ,0 R φ,0 f φ,0 = λ [ k σ1 f φ,0 2 L 2 f φ,0 2 L 2 ] + [ EP [ φ Ykσ1 f φ,0 X ] E P [ φ Y fφ,0 X ]] = λ k σ1 f φ,0 f φ,0,k σ1 f φ,0 + f φ,0 L 2 +E P [ φ Ykσ1 f φ,0 X φ Y f φ,0 X ]. Now, since k σ1 f φ,0 L f φ,0 L k σ1 L1 = f φ,0 L, then using Lemma 13, we obtain: R φ,0 k σ1 f φ,0 R φ,0 f φ,0 where M := sup x R d ρx is supposed to be finite. 2λ f φ,0 L k σ1 f φ,0 f φ,0 L1 +L f φ,0 L EP [ kσ1 f φ,0 X f φ,0 X ] 2λ f φ,0 L + L f φ,0 L M kσ1 f φ,0 f φ,0 L1 2λ f φ,0 L + L f φ,0 L M 1+ d ω f φ,0,σ 1, Now, Theorem 3 is proved by plugging the last three bounds in Proof of Lemma 15 Sample Error The present section is divided into two subsections: the first one presents the proof of Lemma 15, and the auxiliary lemmas that are used in it are then proved in the second subsection. 5.1 Proof of Lemma 15 In order to upper bound the sample error, it is useful to work with a set of functions as small as possible, in a meaning made rigorous below. Although we study algorithms that work on the whole RKHSH σ a priori, let us first show that we can drastically downsize it. Indeed, recall that the marginal distribution of P in X is assumed to have a support included in a compactx R d. The restriction of k σ tox, denoted by k X σ, is a positive definite kernel onx Aronszajn, 1950 with RKHS defined by: where f X denotes the restriction of f tox,and RKHS norm: For any f X H X σ H X σ := { f X : f H σ }, 26 f X H X σ, f X H X σ := inf { f H σ : f H σ and f X = f X }. 27 consider the following risks: R X φ,σ f X := E P X [ φ Y f X X ] + λ f X 2 H X σ, R X φ,σ f X := 1 n n i=1 φ Y i f X X i + λ f X 2 H X σ. We first show that the sample error is the same inh σ andh X σ : 831
16 VERT AND VERT Lemma 16 Let f X φ,σ and ˆf X φ,σ be respectively the minimizers of RX φ,σ and R X φ,σ. Then it holds almost surely that From Lemma 16 we deduce that a.s., R φ,σ fφ,σ = R X φ,σ f X φ,σ, R φ,σ ˆf φ,σ = R X φ,σ ˆf X φ,σ. R φ,σ ˆf φ,σ Rφ,σ fφ,σ = R X φ,σ ˆf X φ,σ R X φ,σ f X φ,σ. 28 In order to upper bound this term, we use concentration inequalities based on local Rademacher complexities Bartlett et al., 2006, 2005; Steinwart and Scovel, In this approach, a crucial role is played by the covering number of a functional classf under the empirical L 2 norm. Remember that for a given sample T := {X 1,Y 1,...,X n,y n } and ε > 0, an εcover for the empirical L 2 norm is a family of function f i i I such that: f F, i I, 1 n 2 1 n f X j f i X j 2 ε. j=1 The covering numbern F,ε,L 2 T is then defined as the smallest cardinal of an εcover. We can now mention the following result, adapted to our notation and setting, that exactly fits our need. Theorem 17 see Steinwart and Scovel, 2004, Theorem 5.8. For σ > 0, letf σ be a convex subset ofh σ X and let φ be a convex loss function. DefineG σ as follows: { } G σ := g f x,y = φy fx+λ f 2 H σ X φy fx φ,σx λ f X φ,σ 2 H σ X : f F σ. 29 where f X φ,σ minimizes RX φ,σ overf σ. Suppose that there are constants c 0 and B > 0 such that, for all g G σ, E P [ g 2 ] ce P [g], and g L B. Furthermore, assume that there are constants a 1 and 0 < p < 2 with sup logn B 1 G σ,ε,l 2 T aε p 30 T Z n for all ε > 0. Then there exists a constant c p > 0 depending only on p such that for all n 1 and all x 1 we have Pr T Z n : R X φ,σ ˆf X φ,σ > R X φ,σ f X φ,σ+c p εn,a,b,c,x e x, 31 where εn,a,b,c, p,x := B+B 2p 2+p c 2 p 2+p a 2 2+p +B+c n x n. 832
17 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS Note that we use the outer probability Pr in 31 because the argument is not necessarily measurable. From the inequalities f X φ,σ 2 H σ X φ0/λ and ˆf X φ,σ 2 H σ X φ0/λ, we see that it is enough to take φ0 F σ = λ BX σ, 32 whereb X σ is the unit ball ofh X σ, to derive a control of 28 from Theorem 17. In order to apply this theorem we now provide uniform upper bounds overg σ for the variance of g and its uniform norm, as well as an upper bound on the covering number ofg σ. Lemma 18 For all σ > 0, for all g G σ, [ E P g 2 ] κσ φ0 κσ L λφ0 λ λ E P[g]. 33 Let us fix κσ φ0 κσ φ0 B = 2L + φ0. 34 λ λ Then, the following two lemmas can be derived: Lemma 19 For all σ > 0, for all g G σ, g L B. 35 Lemma 20 For all σ > 0, 0 < p 2, δ > 0, ε > 0, the following holds: logn B 1 G σ,ε,l 2 T c 2 σ 1 p/21+δd ε p, 36 where c 1 and c 2 are constants that depend neither on σ, nor on ε but they depend on p, δ, d and λ. Combining now the results of Lemmas 18, 19 and 20 allows to apply Theorem 17 withf σ defined by 32, any p [0,2], and the following parameters: from which we deduce Lemma 15. κσ φ0 κσ c = L λφ0 λ λ, κσ φ0 κσ φ0 B = 2L + φ0, λ λ a = c 2 σ 1 p/21+δd, 833
18 VERT AND VERT 5.2 Proofs of Auxiliary Lemmas Proof of Lemma 16 Because the support of P is included inx,the following trivial equality holds: [ f H σ, E P [φy f X] = E P X φ Y f X X ]. 37 Using first the definition of the restricted RKHS 26, then 37 and 27, we obtain R X φ,σ f X φ,σ = inf f X H X σ E P X [ φ Y f X X ] + λ f X H X σ = inf f H σ E P X [ φ Y f X X ] + λ f X H X σ = inf f H σ E P [φy f X]+λ f H σ = R φ,σ fφ,σ, 38 which proves the first statement. In order to prove the second statement, let us first observe that with probability 1, X i X for i = 1,...,n, and therefore: f H σ, n 1 i fx i = n i=1φy 1 n n i=1 φ Y i f X X i, 39 from which we can conclude, using the same line of proof as 38, that the following holds a.s.: R φ,σ ˆf φ,σ = R X φ,σ ˆf X φ,σ. Let us now show that this implies the following equality: ˆf X φ,σ = ˆf φ,σ X. 40 Indeed, on the one hand, ˆf φ,σ X H X σ ˆf φ,σ H σ by 27. On the other hand, 39 implies that n 1 n i=1φ Y i ˆf φ,σ X i = 1 n φ Y i ˆf φ,σ X X i. i=1 As a result, we get R X φ,σ ˆf φ,σ X R φ,σ ˆf φ,σ = R φ,σ X ˆf X φ,σ, from which we deduce that ˆf φ,σ X and ˆf X φ,σ both minimize the strictly convex functional R X φ,σ onhx σ, proving 40. We also deduce from R X φ,σ ˆf φ,σ X = R φ,σ ˆf φ,σ and from 39 that n ˆf φ,σ X H X σ = ˆf φ,σ H σ. 41 Now, using 40, 37, then 41, we can conclude the proof of the second statement as follows: R X φ,σ ˆf X φ,σ = R X φ,σ ˆf φ,σ X = E P X [ φ Y ˆf φ,σ X X ] + λ ˆf φ,σ X H X σ = E P [ φ Y ˆf φ,σ X ] + λ ˆf φ,σ H X σ = R φ,σ ˆf φ,σ, 834
19 CONSISTENCY AND CONVERGENCE RATES OF ONECLASS SVMS AND RELATED ALGORITHMS concluding the proof of Lemma 16. Proof of Lemma 18 We prove the uniform upper bound on the variances of the excessloss functions in terms of their expectation, using an approach similar to but slightly simpler than Bartlett et al. 2006, Lemma 14 and Steinwart and Scovel 2004, Proposition 6.1. First we observe, using 17 and the fact thatf σ φ0/λb σ, that for any f F σ, f L κ σ f H σ κσ φ0. λ As a result, for any x,y X { 1,+1}, g f x,y φy fx φ y f φ,σ x +λ f 2 H σ f φ,σ 2 H σ κσ φ0 L fx f φ,σx +λ f f φ,σ λ H f + f σ φ,σ H σ κσ φ0 κσ L + 2 λφ0 f f φ,σ λ H σ. 42 Taking the square on both sides of this inequality and averaging with respect to P leads to: [ ] f F σ, E P g 2 κσ φ0 κσ f L λφ0 f f φ,σ 2 λ H σ. 43 On the other hand, we deduce from the convexity of φ that for any x,y X { 1,+1} and any f F σ : φy fx+λ f 2 H σ + φy f φ,σ x+λ f φ,σ 2 H σ 2 y fx+yfφ,σ x φ + λ f 2 H σ + f φ,σ 2 H σ 2 2 = φ y f + f φ,σ x + λ f + f φ,σ H σ + λ f f φ,σ 2 Averaging this inequality with respect to P rewrites: R φ,σ f+r φ,σ fφ,σ 2 R φ,σ f + fφ,σ 2 R φ,σ fφ,σ + λ f f φ,σ 2 + λ f f φ,σ 2 2 H σ where the second inequality is due to the definition of f φ,σ as a minimizer of R φ,σ. Therefore we get, for any f F σ, E P [g f ] = R φ,σ f R φ,σ fφ,σ λ 2 f f φ,σ 2 H σ. 2 H σ, 2 H σ
20 VERT AND VERT Combining 43 and 44 finishes the proof of Lemma 18 Proof of Lemma 19 Following a path similar to 42, we can write for any f F σ and any x,y X { 1,+1}: g f x,y φy fx φ y f φ,σ x +λ f 2 H σ f φ,σ 2 H σ κσ φ0 L fx f φ,σx +λ φ0 λ λ κσ φ0 κσ φ0 2L + φ0. λ λ Proof of Lemma 20 Let us introduce the notation l φ fx,y := φy fx and L φ,σ f := l φ f + λ f 2 H σ X, for f H σ X and x, y X { 1, 1}. We can then rewrite 29 as: G σ = { L φ,σ f L φ,σ f X φ,σ : f F σ }. The covering number of a set does not change when the set is translated by a single function, therefore: N B 1 G σ,ε,l 2 T =N B 1 L φ,σ F σ,ε,l 2 T. Denoting now [a,b] the set of constant functions with values between a and b, we deduce, from the fact that λ f 2 H σ X φ0 for f F σ, that B 1 L φ,σ F σ B 1 l φ F σ + [ 0,B 1 φ0 ]. Using the subadditivity of the entropy we therefore get: logn B 1 G σ,2ε,l 2 T logn B 1 l φ F σ,ε,l 2 T + logn [ 0,B 1 φ0 ],ε,l 2 T. 45 In order to upper bound the first term in the r.h.s. of 45, we observe, using 17, that for any f F σ and x X, fx φ0κσ κ σ f H X σ, λ and therefore a simple computation shows that, if ux,y := B 1 φy fx and u x,y := B 1 φy f x are two elements of B 1 l φ F σ with f, f F σ, then for any sample T : u u L2 T B 1 φ0κσ L f f λ L2 T. and therefore logn B 1 l φ F σ,ε,l 2 T 1 φ0κσ logn F σ,bεl,l 2 T λ 1 logn B X φ0κσ λ σ,bεl λ φ0,l 2T
Density Level Detection is Classification
Density Level Detection is Classification Ingo Steinwart, Don Hush and Clint Scovel Modeling, Algorithms and Informatics Group, CCS3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov Abstract We
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationProperties of BMO functions whose reciprocals are also BMO
Properties of BMO functions whose reciprocals are also BMO R. L. Johnson and C. J. Neugebauer The main result says that a nonnegative BMOfunction w, whose reciprocal is also in BMO, belongs to p> A p,and
More informationThe Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
More informationAn example of a computable
An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept
More informationUniversal Algorithm for Trading in Stock Market Based on the Method of Calibration
Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.
More informationSeparation Properties for Locally Convex Cones
Journal of Convex Analysis Volume 9 (2002), No. 1, 301 307 Separation Properties for Locally Convex Cones Walter Roth Department of Mathematics, Universiti Brunei Darussalam, Gadong BE1410, Brunei Darussalam
More informationMathematics for Econometrics, Fourth Edition
Mathematics for Econometrics, Fourth Edition Phoebus J. Dhrymes 1 July 2012 1 c Phoebus J. Dhrymes, 2012. Preliminary material; not to be cited or disseminated without the author s permission. 2 Contents
More informationEMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA
EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA Andreas Christmann Department of Mathematics homepages.vub.ac.be/ achristm Talk: ULB, Sciences Actuarielles, 17/NOV/2006 Contents 1. Project: Motor vehicle
More informationInner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
More informationTHE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS
THE FUNDAMENTAL THEOREM OF ALGEBRA VIA PROPER MAPS KEITH CONRAD 1. Introduction The Fundamental Theorem of Algebra says every nonconstant polynomial with complex coefficients can be factored into linear
More informationON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION
Statistica Sinica 17(27), 2893 ON THE EXISTENCE AND LIMIT BEHAVIOR OF THE OPTIMAL BANDWIDTH FOR KERNEL DENSITY ESTIMATION J. E. Chacón, J. Montanero, A. G. Nogales and P. Pérez Universidad de Extremadura
More informationConvex analysis and profit/cost/support functions
CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 19967 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationDiscussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine GenonCatalot To cite this version: Fabienne
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
More informationStochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
More informationSensitivity analysis of utility based prices and risktolerance wealth processes
Sensitivity analysis of utility based prices and risktolerance wealth processes Dmitry Kramkov, Carnegie Mellon University Based on a paper with Mihai Sirbu from Columbia University Math Finance Seminar,
More information1 if 1 x 0 1 if 0 x 1
Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or
More informationFACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z
FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization
More informationMath 4310 Handout  Quotient Vector Spaces
Math 4310 Handout  Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable
More informationMATH 425, PRACTICE FINAL EXAM SOLUTIONS.
MATH 45, PRACTICE FINAL EXAM SOLUTIONS. Exercise. a Is the operator L defined on smooth functions of x, y by L u := u xx + cosu linear? b Does the answer change if we replace the operator L by the operator
More informationLOOKING FOR A GOOD TIME TO BET
LOOKING FOR A GOOD TIME TO BET LAURENT SERLET Abstract. Suppose that the cards of a well shuffled deck of cards are turned up one after another. At any timebut once only you may bet that the next card
More informationSome stability results of parameter identification in a jump diffusion model
Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationAbout the inverse football pool problem for 9 games 1
Seventh International Workshop on Optimal Codes and Related Topics September 61, 013, Albena, Bulgaria pp. 15133 About the inverse football pool problem for 9 games 1 Emil Kolev Tsonka Baicheva Institute
More informationGambling Systems and MultiplicationInvariant Measures
Gambling Systems and MultiplicationInvariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationIdeal Class Group and Units
Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals
More informationBANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
More informationx if x 0, x if x < 0.
Chapter 3 Sequences In this chapter, we discuss sequences. We say what it means for a sequence to converge, and define the limit of a convergent sequence. We begin with some preliminary results about the
More informationExample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x
Lecture 4. LaSalle s Invariance Principle We begin with a motivating eample. Eample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum Dynamics of a pendulum with friction can be written
More informationConcentration inequalities for order statistics Using the entropy method and Rényi s representation
Concentration inequalities for order statistics Using the entropy method and Rényi s representation Maud Thomas 1 in collaboration with Stéphane Boucheron 1 1 LPMA Université ParisDiderot High Dimensional
More informationCHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,
More informationMetric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
More informationRIGIDITY OF HOLOMORPHIC MAPS BETWEEN FIBER SPACES
RIGIDITY OF HOLOMORPHIC MAPS BETWEEN FIBER SPACES GAUTAM BHARALI AND INDRANIL BISWAS Abstract. In the study of holomorphic maps, the term rigidity refers to certain types of results that give us very specific
More informationChapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm.
Chapter 4, Arithmetic in F [x] Polynomial arithmetic and the division algorithm. We begin by defining the ring of polynomials with coefficients in a ring R. After some preliminary results, we specialize
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a nonempty
More informationSeveral Views of Support Vector Machines
Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min
More informationLINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
More information4: SINGLEPERIOD MARKET MODELS
4: SINGLEPERIOD MARKET MODELS Ben Goldys and Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2015 B. Goldys and M. Rutkowski (USydney) Slides 4: SinglePeriod Market
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationThe sum of digits of polynomial values in arithmetic progressions
The sum of digits of polynomial values in arithmetic progressions Thomas Stoll Institut de Mathématiques de Luminy, Université de la Méditerranée, 13288 Marseille Cedex 9, France Email: stoll@iml.univmrs.fr
More informationPrime Numbers and Irreducible Polynomials
Prime Numbers and Irreducible Polynomials M. Ram Murty The similarity between prime numbers and irreducible polynomials has been a dominant theme in the development of number theory and algebraic geometry.
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationChapter 13: Basic ring theory
Chapter 3: Basic ring theory Matthew Macauley Department of Mathematical Sciences Clemson University http://www.math.clemson.edu/~macaule/ Math 42, Spring 24 M. Macauley (Clemson) Chapter 3: Basic ring
More informationA Coefficient of Variation for Skewed and HeavyTailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University
A Coefficient of Variation for Skewed and HeavyTailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a
More informationand s n (x) f(x) for all x and s.t. s n is measurable if f is. REAL ANALYSIS Measures. A (positive) measure on a measurable space
RAL ANALYSIS A survey of MA 641643, UAB 19992000 M. Griesemer Throughout these notes m denotes Lebesgue measure. 1. Abstract Integration σalgebras. A σalgebra in X is a nonempty collection of subsets
More information24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NPcomplete. Then one can conclude according to the present state of science that no
More informationYou know from calculus that functions play a fundamental role in mathematics.
CHPTER 12 Functions You know from calculus that functions play a fundamental role in mathematics. You likely view a function as a kind of formula that describes a relationship between two (or more) quantities.
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / 2184189 CesaBianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationUniversity of Ostrava. Fuzzy Transforms
University of Ostrava Institute for Research and Applications of Fuzzy Modeling Fuzzy Transforms Irina Perfilieva Research report No. 58 2004 Submitted/to appear: Fuzzy Sets and Systems Supported by: Grant
More informationDuality in General Programs. Ryan Tibshirani Convex Optimization 10725/36725
Duality in General Programs Ryan Tibshirani Convex Optimization 10725/36725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T
More informationINDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
More informationStationary random graphs on Z with prescribed iid degrees and finite mean connections
Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the nonnegative
More informationFurther Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing
More informationTHREE DIMENSIONAL GEOMETRY
Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,
More information1 Norms and Vector Spaces
008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)
More informationDuality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
More informationAdaptive Search with Stochastic Acceptance Probabilities for Global Optimization
Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,
More informationInternational Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013
FACTORING CRYPTOSYSTEM MODULI WHEN THE COFACTORS DIFFERENCE IS BOUNDED Omar Akchiche 1 and Omar Khadir 2 1,2 Laboratory of Mathematics, Cryptography and Mechanics, Fstm, University of Hassan II MohammediaCasablanca,
More informationResearch Article Stability Analysis for HigherOrder Adjacent Derivative in Parametrized Vector Optimization
Hindawi Publishing Corporation Journal of Inequalities and Applications Volume 2010, Article ID 510838, 15 pages doi:10.1155/2010/510838 Research Article Stability Analysis for HigherOrder Adjacent Derivative
More informationAn optimal transportation problem with import/export taxes on the boundary
An optimal transportation problem with import/export taxes on the boundary Julián Toledo Workshop International sur les Mathématiques et l Environnement Essaouira, November 2012..................... Joint
More informationMA651 Topology. Lecture 6. Separation Axioms.
MA651 Topology. Lecture 6. Separation Axioms. This text is based on the following books: Fundamental concepts of topology by Peter O Neil Elements of Mathematics: General Topology by Nicolas Bourbaki Counterexamples
More informationCHAPTER 5. Product Measures
54 CHAPTER 5 Product Measures Given two measure spaces, we may construct a natural measure on their Cartesian product; the prototype is the construction of Lebesgue measure on R 2 as the product of Lebesgue
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationON COMPLETELY CONTINUOUS INTEGRATION OPERATORS OF A VECTOR MEASURE. 1. Introduction
ON COMPLETELY CONTINUOUS INTEGRATION OPERATORS OF A VECTOR MEASURE J.M. CALABUIG, J. RODRÍGUEZ, AND E.A. SÁNCHEZPÉREZ Abstract. Let m be a vector measure taking values in a Banach space X. We prove that
More informationLecture Notes on Elasticity of Substitution
Lecture Notes on Elasticity of Substitution Ted Bergstrom, UCSB Economics 20A October 26, 205 Today s featured guest is the elasticity of substitution. Elasticity of a function of a single variable Before
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationNumerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
More informationComputing divisors and common multiples of quasilinear ordinary differential equations
Computing divisors and common multiples of quasilinear ordinary differential equations Dima Grigoriev CNRS, Mathématiques, Université de Lille Villeneuve d Ascq, 59655, France Dmitry.Grigoryev@math.univlille1.fr
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationPUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.
PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include
More informationA SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS
A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS Eusebio GÓMEZ, Miguel A. GÓMEZVILLEGAS and J. Miguel MARÍN Abstract In this paper it is taken up a revision and characterization of the class of
More informationMOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao yufeiz@mit.edu
Integer Polynomials June 9, 007 Yufei Zhao yufeiz@mit.edu We will use Z[x] to denote the ring of polynomials with integer coefficients. We begin by summarizing some of the common approaches used in dealing
More informationMath 5311 Gateaux differentials and Frechet derivatives
Math 5311 Gateaux differentials and Frechet derivatives Kevin Long January 26, 2009 1 Differentiation in vector spaces Thus far, we ve developed the theory of minimization without reference to derivatives.
More informationA PRIORI ESTIMATES FOR SEMISTABLE SOLUTIONS OF SEMILINEAR ELLIPTIC EQUATIONS. In memory of RouHuai Wang
A PRIORI ESTIMATES FOR SEMISTABLE SOLUTIONS OF SEMILINEAR ELLIPTIC EQUATIONS XAVIER CABRÉ, MANEL SANCHÓN, AND JOEL SPRUCK In memory of RouHuai Wang 1. Introduction In this note we consider semistable
More informationDecember 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in twodimensional space (1) 2x y = 3 describes a line in twodimensional space The coefficients of x and y in the equation
More informationTreerepresentation of set families and applications to combinatorial decompositions
Treerepresentation of set families and applications to combinatorial decompositions BinhMinh BuiXuan a, Michel Habib b Michaël Rao c a Department of Informatics, University of Bergen, Norway. buixuan@ii.uib.no
More informationLimit processes are the basis of calculus. For example, the derivative. f f (x + h) f (x)
SEC. 4.1 TAYLOR SERIES AND CALCULATION OF FUNCTIONS 187 Taylor Series 4.1 Taylor Series and Calculation of Functions Limit processes are the basis of calculus. For example, the derivative f f (x + h) f
More informationA simple and fast algorithm for computing exponentials of power series
A simple and fast algorithm for computing exponentials of power series Alin Bostan Algorithms Project, INRIA ParisRocquencourt 7815 Le Chesnay Cedex France and Éric Schost ORCCA and Computer Science Department,
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationOnline Learning, Stability, and Stochastic Gradient Descent
Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain
More informationON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION. E. I. Pancheva, A. GacovskaBarandovska
Pliska Stud. Math. Bulgar. 22 (2015), STUDIA MATHEMATICA BULGARICA ON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION E. I. Pancheva, A. GacovskaBarandovska Smirnov (1949) derived four
More information11 Ideals. 11.1 Revisiting Z
11 Ideals The presentation here is somewhat different than the text. In particular, the sections do not match up. We have seen issues with the failure of unique factorization already, e.g., Z[ 5] = O Q(
More informationNonlinear Optimization: Algorithms 3: Interiorpoint methods
Nonlinear Optimization: Algorithms 3: Interiorpoint methods INSEAD, Spring 2006 JeanPhilippe Vert Ecole des Mines de Paris JeanPhilippe.Vert@mines.org Nonlinear optimization c 2006 JeanPhilippe Vert,
More informationIntroduction to Probability
Introduction to Probability EE 179, Lecture 15, Handout #24 Probability theory gives a mathematical characterization for experiments with random outcomes. coin toss life of lightbulb binary data sequence
More informationOn the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers
On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More information3. INNER PRODUCT SPACES
. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.
More informationE3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
More informationMetric Spaces. Chapter 1
Chapter 1 Metric Spaces Many of the arguments you have seen in several variable calculus are almost identical to the corresponding arguments in one variable calculus, especially arguments concerning convergence
More informationCORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
More informationMATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
More informationALMOST COMMON PRIORS 1. INTRODUCTION
ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type
More informationTHE CENTRAL LIMIT THEOREM TORONTO
THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................
More informationPacific Journal of Mathematics
Pacific Journal of Mathematics GLOBAL EXISTENCE AND DECREASING PROPERTY OF BOUNDARY VALUES OF SOLUTIONS TO PARABOLIC EQUATIONS WITH NONLOCAL BOUNDARY CONDITIONS Sangwon Seo Volume 193 No. 1 March 2000
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublineartime algorithm for testing whether a bounded degree graph is bipartite
More information