Bayes Point Machines

Transcription

1 Journal of Machine Learning Research (2) Subitted 2/; Published 8/ Bayes Point Machines Ralf Herbrich Microsoft Research, St George House, Guildhall Street, CB2 3NH Cabridge, United Kingdo Thore Graepel Technical University of Berlin, Franklinstr 28/29, 587 Berlin, Gerany Colin Capbell Departent of Engineering Matheatics, Bristol University, BS8 TR Bristol, United Kingdo Editor: Christopher K I Willias Abstract Kernel-classifiers coprise a powerful class of non-linear decision functions for binary classification The support vector achine is an exaple of a learning algorith for kernel classifiers that singles out the consistent classifier with the largest argin, ie inial real-valued output on the training saple, within the set of consistent hypotheses, the so-called version space We suggest the Bayes point achine as a well-founded iproveent which approxiates the Bayes-optial decision by the centre of ass of version space We present two algoriths to stochastically approxiate the centre of ass of version space: a billiard sapling algorith and a sapling algorith based on the well known perceptron algorith It is shown how both algoriths can be extended to allow for soft-boundaries in order to adit training errors Experientally, we find that for the zero training error case Bayes point achines consistently outperfor support vector achines on both surrogate data and real-world benchark data sets In the soft-boundary/soft-argin case, the iproveent over support vector achines is shown to be reduced Finally, we deonstrate that the realvalued output of single Bayes points on novel test points is a valid confidence easure and leads to a steady decrease in generalisation error when used as a rejection criterion Introduction Kernel achines have recently gained a lot of attention due to the popularisation of the support vector achine (Vapnik, 995) with a focus on classification and the revival of Gaussian processes for regression (Willias, 999) Subsequently, support vector achines have been odified to handle regression (Sola, 998) and Gaussian processes have been adapted to the proble of classification (Willias and Barber, 998; Opper and Winther, 2) Both schees essentially work in the sae function space that is characterised by kernels and covariance functions, respectively Whilst the foral siilarity of the two ethods is striking, the underlying paradigs of inference are very different The support vector achine was inspired by results fro statistical/pac learning theory while Gaussian processes are usually considered in a Bayesian fraework This ideological clash can be viewed as a continuation in achine learning of the by now classical disagreeent between Bayesian and frequentistic statistics (Aitchison, 964) With regard to algorithics the two schools of thought appear to favour two different ethods of learning and predicting: the support vector counity as a consequence of the forulation of the support vector achine as a quadratic prograing proble focuses on learning as optiisation while the Bayesian counity favours sapling schees based on the Bayesian posterior Of course there exists a strong relationship between the two ideas, in particular with the Bayesian axiu a posteriori (MAP) estiator being the solution of an optiisation proble In practice, optiisation based algoriths have the advantage of a unique, deterinistic solution and the availability of the cost function as an indicator of the quality of the solution In contrast, Bayesian algoriths based on sapling and voting are ore flexible and enjoy the so-called anytie property, providing a c 2 R Herbrich, T Graepel, C Capbell

2 HERBRICH, GRAEPEL, CAMPBELL relatively good solution at any point in tie Often, however, they suffer fro the coputational costs of sapling the Bayesian posterior In this paper we present the Bayes point achine as an approxiation to Bayesian inference for linear classifiers in kernel space In contrast to the Gaussian process viewpoint we do not define a Gaussian prior on the length w of the weight vector Instead, we only consider weight vectors of length w = because it is only the spatial direction of the weight vector that atters for classification It is then natural to define a unifor prior on the resulting ball-shaped hypothesis space Hence, we deterine the centre of ass of the resulting posterior that is unifor in version space, ie in the zero training error region It should be kept in ind that the centre of ass is erely an approxiation to the real Bayes point fro which the nae of the algorith was derived In order to estiate the centre of ass we suggest both a dynaic syste called a kernel billiard and an approxiative ethod that uses the perceptron algorith trained on perutations of the training saple The latter ethod proves to be efficient enough to ake the Bayes point achine applicable to large data sets An additional insight into the usefulness of the centre of ass coes fro the statistical echanics approach to neural coputing where the generalisation error for Bayesian learning algoriths has been calculated for the case of randoly constructed and unbiased patterns x (Opper and Haussler, 99) Thus if ζ is the nuber of training exaples per weight and ζ is large, the generalisation error of the centre of ass scales as 44/ζ whereas scaling with ζ is poorer for the solutions found by the linear support vector achine (scales as 5/ζ; see Opper and Kinzel, 995), Adaline (scales as 24/ ζ; see Opper et al, 99) and other approaches Of course any of the viewpoints and algoriths presented in this paper are based on extensive previous work carried out by nuerous authors in the past In particular it sees worthwhile to ention that linear classifiers have been studied intensively in two rather distinct counities: The achine learning counity and the statistical physics counity While it is beyond the scope of this paper to review the entire history of the field we would like to ephasise that our geoetrical viewpoint as expressed later in the paper has been inspired by the very original paper Playing billiard in version space by P Ruján (Ruján, 997) Also, in that paper the ter Bayes point was coined and the idea of using a billiard-like dynaical syste for unifor sapling was introduced Both we (Herbrich et al, 999a,b, 2a) and Ruján and Marchand (2) independently generalised the algorith to be applicable in kernel space Finally, following a theoretical suggestion of Watkin (993) we were able to scale up the Bayes point algorith to large data sets by using different perceptron solutions fro perutations of the training saple The paper is structured as follows: In the following section we review the basic ideas of Bayesian inference with a particular focus on classification learning Along with a discussion about the optiality of the Bayes classification strategy we show that for the special case of linear classifiers in feature space the centre of ass of all consistent classifiers is arbitrarily close to the Bayes point (with increasing training saple size) and can be efficiently estiated in the linear span of the training data Moreover, we give a geoetrical picture of support vector learning in feature space which reveals that the support vector achine can be viewed as an approxiation to the Bayes point achine In Section 3 we present two algoriths for the estiation of the centre of ass of version space one exact ethod and an approxiate ethod tailored for large training saples An extensive list of experiental results is presented in Section 4, both on sall achine learning benchark datasets as well as on large scale datasets fro the field of handwritten digit recognition In Section 5 we suarise the results and discuss soe theoretical extensions of the ethod presented In order to unburden the ain text, the lengthy proofs as well as the pseudocode have been relegated to the appendix We denote n tuples by italic bold letters (eg x =(x,,x n )), vectors by roan bold letters (eg x), rando variables by sans serif font (eg X) and vector spaces by calligraphic capitalised letters (eg X ) The sybols P,E and I denote a probability easure, the expectation of a rando variable and the indicator function, respectively 246

3 BAYES POINT MACHINES 2 A Bayesian Consideration of Learning In this section we would like to revisit the Bayesian approach to learning (see Buntine, 992; MacKay, 99; Neal, 996; Bishop, 995, for a ore detailed treatent) Suppose we are given a training saple z =(x,y)=((x,y ),,(x,y )) (X Y ) of size drawn iid fro an unknown distribution P Z = P XY Furtherore, assue we are given a fixed set H Y X of functions h : X Y referred to as hypothesis space The task of learning is then to find the function h which perfors best on new yet unseen patterns z =(x,y) drawn according to P XY Definition (Learning Algorith) A (deterinistic) learning algorith A : = Z Y X is a apping fro training saples z of arbitrary size N to functions fro X to Y The iage of A, ie {A(z) z Z } Y X, is called the effective hypothesis space H A, of the learning algorith A for the training saple size N If there exists a hypothesis space H Y X such that for every training saple size N we have H A, H we shall oit the indices on H In order to assess to quality of a function h H we assue the existence of a loss function l : Y Y R + The loss l (y,y ) R + is understood to easure the incurred cost when predicting y while the true output was y Hence we always assue that for all y Y, l (y,y)= A typical loss function for classification is the so called zero-one loss l defined as follows Definition 2 (Zero-One Loss) Given a fixed output space Y, the zero-one loss is defined by l ( y,y ) := I y y Based on the concept of a loss l, let us introduce several quality easures for hypotheses h H Definition 3 (Generalisation and Training Error) Given a probability easure P XY and a loss l : Y Y R + the generalisation error R[h] of a function h : X Y is defined by R[h] := E XY [l (h(x),y)] Given a training saple z =(x,y) (X Y ) of size and a loss l : Y Y R + the training error R ep [h,z] of a function h : X Y is given by R ep [h,z] := l (h(x i ),y i ) i= Clearly, only the generalisation error R[h] is appropriate to capture the perforance of a fixed classifier h H on new patterns z =(x,y) Nonetheless, we shall see that the training error plays a crucial role as it provides an estiate of the generalisation error based on the training saple Definition 4 (Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for any fixed training saple size N the generalisation error R [A] of A is defined by R [A] := E Z [R[A(Z)]], that is, the expected generalisation error of the hypotheses found by the algorith Note that for any loss function l : Y Y R + a sall generalisation error R [A] of the algorith A guarantees a sall generalisation error for ost randoly drawn training saples z because by Markov s inequality we have for ε >, P Z (R[A(Z)] > ε E Z [R[A(Z)]]) ε Hence we can view R [A] also as a perforance easure of A s hypotheses for randoly drawn training saples z Finally, let us consider a probability easure P H over the space of all possible appings fro X to Y Then, the average generalisation error of a learning algorith A is defined as follows 247

4 HERBRICH, GRAEPEL, CAMPBELL Definition 5 (Average Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for each fixed training saple size N the average generalisation error R [A] of A is defined by R [A] := E H [ EZ H=h [ EX [ EY X=x,H=h [l ((A(Z))(x),Y)] ]]], () that is, the average perforance of the algorith s A solution learned over the rando draw of training saples and target hypotheses The average generalisation error is the standard easure of perforance of an algorith A if we have little knowledge about the potential function h that labels all our data expressed via P H Then, the easure () averages out our ignorance about the unknown h thus considering perforance of A on average There is a noticeable relation between R [A] and R [A] if we assue that given a easure P H, the conditional distribution of outputs y given x is governed by P Y X=x (y)=p H (H(x)=y) (2) Under this condition we have that R [A]=R [A] This result, however, is not too surprising taking into account that under the assuption (2) the easure P H fully encodes the unknown relationship between inputs x and outputs y 2 The Bayesian Solution In the Bayesian fraework we are not siply interested in h := argin h H R[h] itself but in our knowledge or belief in h To this end, Bayesians use the concept of prior and posterior belief, ie the knowledge of h before having seen any data and after having seen the data which in the current case is our training saple z It is well known that under consistency rules known as Cox s axios (Cox, 946) beliefs can be apped onto probability easures P H Under these rather plausible conditions the only consistent way to transfer prior belief P H into posterior belief P H Z =z is therefore given by Bayes theore: P Z P H Z =z (h) = H=h (z) [ E H PZ H=h (z) ] P H (h)= PY X =x,h=h (y) [ E H PY X =x,h=h (y) ] P H (h) (3) The second expression is obtained by noticing that P Z H=h (z)=p Y X =x,h=h (y)p X H=h (x)=p Y X =x,h=h (y)p X (x) because hypotheses do not have an influence on the generation of patterns Based on a given loss function l we can further decopose the first ter of the nuerator of (3) known as the likelihood of h Let us assue that the probability of a class y given an instance x and an hypothesis h is inverse proportional to the exponential of the loss incurred by h on x Thus we obtain exp( β l (h(x),y)) P Y X=x,H=h (y) = exp( β l (h(x),y )) = exp( β l (h(x),y)) C (x) y Y { +exp( β) if l (h(x),y) := l (h(x),y)= = exp( β) +exp( β) if l (h(x),y) := l (h(x),y)=, (4) where C (x) is a noralisation constant which in the case of the zero-one loss l is independent 2 of x and β controls the assued level of noise Note that the loss used in the exponentiated loss likelihood function In fact, it already suffices to assue that E Y X=x [l (y,y)] = E H [l (y,h(x))], ie the prior correctly odels the conditional distribution of the classes as far as the fixed loss is concerned 2 Note that for loss functions with real-valued arguents this need not be the case which akes a noralisation independent of x quite intricate (see Sollich, 2, for a detailed treatent) 248

5 BAYES POINT MACHINES is not to be confused with the decision-theoretic loss used in the Bayesian fraework, which is introduced only after a posterior has been obtained in order to reach a risk optial decision Definition 6 (PAC Likelihood) Suppose we are given an arbitrary loss function l : Y Y R + Then, we call the function P Y X=x,H=h (y) := I y=h(x), (5) of h the PAC likelihood for h Note that (5) is the liiting case of (4) for β Assuing the PAC likelihood it iediately follows that for any prior belief P H the posterior belief P H Z =z siplifies to { PH (h) P H (V (z)) if h V (z) P H Z =z (h)=, (6) ifh / V (z) where the version space V (z) is defined as follows (see Mitchell, 977, 982) Definition 7 (Version Space) Given an hypothesis space H Y X and a training saple z =(x,y) (X Y ) of size N the version space V (z) H is defined by V (z) := { h H i {,,} : h(x i )=y i } Since all inforation contained in the training saple z is used to update the prior P H by equation (3) all that will be used to classify a novel test point x is the posterior belief P H Z =z 22 The Bayes Classification Strategy In order to classify a new test point x, for each class y the Bayes classification strategy 3 deterines the loss incurred by each hypothesis h H applied to x and weights it according to its posterior probability P H Z =z (h) The final decision is ade for the class y Y that achieves the iniu expected loss, ie Bayes z (x) := argin y Y This strategy has the following appealing property E H Z =z [l (H(x),y)] (7) Theore 8 (Optiality of the Bayes Classification Strategy) Suppose we are given a fixed hypothesis space H Y X Then, for any training saple size N, for any syetric loss l : Y Y R +, for any two easures P H and P X, aong all learning algoriths the Bayes classification strategy Bayes z given by (7) iniises the average generalisation error R [Bayes z ] under the assuption that for each h with P H (h) > y Y : x X : E Y X=x,H=h [l (y,y)] = l (y,h(x)) (8) Proof Let us consider a fixed learning algorith A Then it holds true that [ [ [ R [A] = E H EZ H=h EX EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EH EZ H=h EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EZ EH Z =z EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ = E X EZ EH Z =z [l ((A(Z))(X),H(X))] ]], (9) where we exchanged the order of expectations over X in the second line, applied the theore of repeated integrals (see, eg Feller, 966) in the third line and finally used (8) in the last line Using the syetry of the loss function, the inner-ost expression of (9) is iniised by the Bayes classification strategy (7) 3 The reason we do not call this apping fro X to Y a classifier is that the resulting apping is (in general) not within the hypothesis space considered beforehand 249

6 HERBRICH, GRAEPEL, CAMPBELL for any possible training saple z and any possible test point x Hence, (7) iniises the whole expression which proves the theore In order to enhance the understanding of this result let us consider the siple case of l = l and Y = {,+} Then, given a particular classifier h H having non-zero prior probability P H (h) >, by assuption (8) we require that the conditional distribution of classes y given x is delta peaked at h(x) because E Y X=x,H=h (l (y,y)) = l (y,h(x)), P Y X=x,H=h ( y) = I y h(x), P Y X=x,H=h (y) = I h(x)=y Although for a fixed h H drawn according to P H we do not know that Bayes z achieves the sallest generalisation error R[Bayes z ] we can guarantee that on average over the rando draw of h s the Bayes classification strategy is superior In fact, the optial classifier for a fixed h H is siply h itself 4 and in general Bayes z (x) h(x) for at least a few x X 23 The Bayes Point Algorith Although the Bayes classification strategy is on average the optial strategy to perfor when given liited aount of training data z, it is coputationally very deanding as it requires the evaluation of P H Z =z (l (H(x),y)) for each possible y at each new test point x (Graepel et al, 2) The proble arises because the Bayes classification strategy does not correspond to any one single classifier h H One way to tackle this proble is to require the classifier A(z) learned fro any training saple z to lie within a fixed hypothesis space H Y X containing functions h H whose evaluation at a particular test point x can be carried out efficiently Thus if it is additionally required to liit the possible solution of a learning algorith to a given hypothesis space H Y X, we can in general only hope to approxiate Bayes z Definition 9 (Bayes Point Algorith) Suppose we are given a fixed hypothesis space H X Y and a fixed loss l : Y Y R + Then, for any two easures P X and P H, the Bayes point algorith A bp is given by A bp (z) := argin h H E X [ EH Z =z [l (h(x),h(x))] ], that is, for each training saple z Z the Bayes point algorith chooses the classifier h bp := A bp (z) H that iics best the Bayes classification strategy (7) on average over randoly drawn test points The classifier A bp (z) is called the Bayes point Assuing the correctness of the odel given by (8) we furtherore reark that the Bayes point algorith A bp is the best approxiation to the Bayes classification strategy (7) in ters of the average generalisation error, ie easuring the distance of the learning algorith A for H using the distance A Bayes = R [A] R [Bayes] In this sense, for a fixed training saple z we can view the Bayes point h bp as a projection of Bayes z into the hypothesis space H Y X The difficulty with the Bayes point algorith, however, is the need to know the input distribution P X for the deterination of the hypothesis learned fro z This soehow liits the applicability of the algorith as opposed to the Bayes classification strategy which requires only broad prior knowledge about the underlying relationship expressed via soe prior belief P H 4 It is worthwhile entioning that the only inforation to be used in any classification strategy is the training saple z and the prior P H Hence it is ipossible to detect which classifier h H labels a fixed tuple x only on the basis of the labels y observed on the training saple Thus, although we ight be lucky in guessing h for a fixed h H and z Z we cannot do better than the Bayes classification strategy Bayes z when considering the average perforance the average being taken over the rando choice of the classifiers and the training saples z 25

7 BAYES POINT MACHINES 23 THE BAYES POINT FOR LINEAR CLASSIFIERS We now turn our attention to the special case of linear classifiers where we assue that N easureents of the objects x are taken by features φ i : X R thus foring a (vectorial) feature ap φ : X K l N 2 = (φ (x),,φ N (x)) Note that by this forulation the special case of vectorial objects x is autoatically taken care of by the identity ap φ(x)=x For notational convenience we use the shorthand notation 5 x for φ(x) such that x,w := N i= φ i (x)w i Hence, for a fixed apping φ the hypothesis space is given by H := { x sign( x,w ) w W }, W := {w K w = } () As each hypothesis h w is uniquely defined by its weight vector w we shall in the following consider prior beliefs P W over W, ie possible weight vectors (of unit length), in place of priors P H By construction, the output space is Y = {,+} and we furtherore consider the special case of l = l as defined by Definition 2 If we assue that the input distribution is spherically Gaussian in the feature space K of diensionality d = di(k ), ie f X (x)= ( exp x 2), () then we find that the centre of ass w c = π d 2 E W Z =z [W] EW Z =z [W] is a very good approxiation to the Bayes point w bp and converges towards w bp if the posterior belief P W Z =z becoes sharply peaked (for a siilar result see Watkin, 993) Theore (Optiality of the Centre of Mass) Suppose we are given a fixed apping φ : X K l N 2 Then, for all N, ifp X possesses the density () and the prior belief is correct, ie (8) is valid, the average generalisation error of the centre of ass as given by (2) always fulfils R [A c ] R [ Abp ] E Z [κ(ε(z))], (2) where and { arccos(ε) κ(ε) := π ε 2 if ε < 23 otherwise ε(z) := in w c,w w:p W Z =z (w)>, The lengthy proof of this theore is given in Appendix A The interesting fact to note about this result is that li ε κ(ε)= and thus whenever the prior belief P W is not vanishing for soe w, li E Z [κ(ε(z))] =, because for increasing training saple size the posterior is sharply peaked at the weight vector labelling the data 6 This shows that for increasing training saple size the centre of ass (under the posterior P W Z =z) is a good approxiation to the optial projection of the Bayes classification strategy the Bayes point Henceforth, any algorith which ais at returning the centre of ass under the posterior P W Z =z is called a Bayes point achine Note that in the case of the PAC likelihood as defined in Definition 6 the centre of ass under the posterior P W Z =z coincides with the centre of ass of version space (see Definition 7) 5 This should not be confused with x which denotes the saple (x,,x ) of training objects 6 This result is a slight generalisation of the result in Watkin (993) which only proved this to be true for the unifor prior P W 25

8 HERBRICH, GRAEPEL, CAMPBELL Ü Ü Û ¼ µ Û Ü Û Ü Û ¼ Figure : Shown is the argin a = γ x (w)= x,w under the assuption that w = x = At the sae tie, a (length of the dotted line) equals the distance of x fro the hyperplane {x x,w = } (dashed line) as well as the distance of the weight vector w fro the hyperplane {w x,w = } (dashed { line) Note, } however, that the Euclidean distance of w fro the separating boundary w W x,w = equals b(a) where b is a strictly onotonic function of its arguent 24 A (Pseudo) Bayesian Derivation of the Support Vector Machine In this section we would like to show that the well known support vector achine (Boser et al, 992; Cortes, 995; Vapnik, 995) can also be viewed as an approxiation to the centre of ass of version space V (z) in the noise free scenario, ie considering the PAC likelihood given in Definition 6, and additionally assuing that x i x : x i = φ(x i ) = const In order to see this let us recall that the support vector achine ais at axiising the argin γ z (w) of the weight vector w on the training saple z given by γ z (w) := in i {,,} y i x i,w } w {{ } γ xi (w) = in w y i x i,w, (3) i {,,} which for all w of unit length is erely the inial real-valued output (flipped to the correct sign) over the whole training saple In order to solve this proble algorithically one takes advantage of the fact that fixing the real-valued output to one (rather than the nor w of the weight vector w) renders the proble of finding the argin axiiser w SVM as a proble with a quadratic objective function ( w 2 = w w) under linear constraints (y i x i,w ), ie ( ) w SVM := argax w W in y i x i,w i {,,} argin w {v ini {,,} y i x i,v =} (4) ( w 2) (5) Note that the set of weight vectors in (5) are called the weight vectors of the canonical hyperplanes (see Vapnik, 998, p 42) and that this set is highly dependent on the given training saple Nonetheless, the solution to (5) is (up to scaling) equivalent to the solution of (4) a forulation uch ore aenable for theoretical studies Interestingly, however, the quantity γ xi (w) as iplicitly defined in (3) is not only the distance of the point y i x i fro the hyperplane having the noral w but also x i ties the Euclidean distance of the point w fro the hyperplane having the noral y i x i (see Figure ) Thus γ z (w) can be viewed as the radius of 252

9 BAYES POINT MACHINES the ball { v W w v b(γ z (w)) } that only contains weight vectors in version space V (z) Here, b : R + R + is a strictly onotonic function of its arguent and its effect is graphically depicted in Figure As a consequence thereof, axiising the argin γ z (w) over the choice of w returns the classifier w SVM that is the centre of the largest ball still inscribable in version space Note that the whole reasoning relied on the assuption that all training points x i have a constant nor in feature space K If this assuption is violated, each distance of a classifier w to the hyperplane having the noral y i x i is easured on a different scale and thus the points with the largest nor x i in feature space K have the highest influence on the resulting solution To circuvent this proble is has been suggested elsewhere that input vectors should be noralised in feature space before applying any kernel ethod in particular the support vector achine algorith (see Herbrich and Graepel, 2; Schölkopf et al, 999; Joachis, 998; Haussler, 999) Furtherore, all indices I SV {,,} at which the iniu y i x i,w SVM in (4) is attained are the ones for which y i x i,w = in the forulation (5) As the latter are called support vectors we see that the support vectors are the training points at which the largest inscribable ball touches the corresponding hyperplane { w W (yi x i,w = ) } 25 Applying the Kernel Trick When solving (5) over the possible choices of w W it is well known that the solution w SVM adits the following representation w SVM = α i x i, that is the solution to (5) ust live in the linear span of the training points This follows naturally fro the following theore (see also Schölkopf et al, 2) Theore (Representer Theore) Suppose we are given a fixed apping φ : X K l N 2, a training saple z =(x,y) Z, a cost function c : X Y R R { } strictly onotonically decreasing in the third arguent and the class of linear functions in K as given by () Then any w z W defined by w z := argin c(x,y,( x,w,, x,w )) (6) w W adits a representation of the for i= α R : w z = i= α i x i (7) The proof is given in Appendix A2 In order to see that this theore applies to support vector achines note that (4) is equivalent to the iniiser of (6) when using c(x,y,( x,w,, x,w )) = in y i y y i x i,w, which is strictly onotonically decreasing in its third arguent A slightly ore difficult arguent is necessary to see that the centre of ass (2) can also be written as a iniiser of (6) using a[ specific cost function c At first we recall that the centre of ass has the property of iniising E W Z =z w W 2] over the choice of w W (see also (3)) Theore 2 (Sufficiency of the linear span) Suppose we are given a fixed apping φ : X K l N 2 Let us assue that P W is unifor and P Y X=x,W=w (y) = f (sign(y x,w )), ie the likelihood depends on the sign of the real-valued output y x,w of w Let L x := { i= α ix i α R } be the linear span of apped data points {x,,x } and W x := W L x Then for any training saple z Z and any w W w W v 2 dp W Z =z (v)=c w v 2 dp W Z W =z (v), (8) x 253

10 HERBRICH, GRAEPEL, CAMPBELL that is, up to a constant C R + that is independent of w it suffices to consider vectors of unit length in the linear span of the apped training points {x,,x } The proof is given in Appendix A3 An iediate consequence of this theore is the fact that we only need to consider the diensional sphere W x in order to find the centre of ass under the assuption of a unifor prior P W Hence a loss function c such that (6) finds the centre of ass is given by ( ) c(x,y,( x,w,, x,w )) = 2 α i x i,w dp A Z R =(x,y) where P A Z =z is only non-zero for vectors α such that i= α ix i = and is independent of w The treendous advantage of a representation of the solution w z by (7) becoes apparent when considering the real-valued output of a classifier at any given data point (either training or test point) w z,x = α i x i,x = α i x i,x = α i k (x i,x) i= i= i= Clearly, all that is needed in the feature space K is the inner product function k (x, x) := φ(x),φ( x) (9) Reversing the chain of arguents indicates how the kernel trick ay be used to find an efficient ipleentation We fix a syetric function k : X X R called kernel and show that there exists a feature apping φ k : X K l N 2 such that (9) is valid for all x, x X A sufficient condition for k being a valid inner product function is given by Mercer s theore (see Mercer, 99) In a nutshell, whenever the evaluation of k at any given saple (x,,x ) results in a positive seidefinite atrix G ij := k (x i,x j ) then k is a so called Mercer kernel The atrix G is called the Gra atrix and is the only quantity needed in support vector and Bayes point achine learning For further details on the kernel trick the reader is referred to Schölkopf et al (999); Cristianini and Shawe-Taylor (2); Wahba (99); Vapnik (998) 3 Estiating the Bayes Point in Feature Space In order to estiate the Bayes point in feature space K we consider a Monte Carlo ethod, ie instead of exactly coputing the expectation (2) we approxiate it by an average over weight vectors w drawn according to P W Z =z and restricted to W x (see Theore 2) In the following we will restrict ourselves to the PAC likelihood given in (5) and P W being unifor on the unit sphere W K By this assuption we know that the posterior is unifor over version space (see (6)) In Figure 2 we plotted an exaple for the special case of N = 3 diensional feature space K It is, however, already very difficult to saple uniforly fro version space V (z) as this set of points lives on a convex polyhedron on the unit sphere in 7 W x In the following two subsections we present two ethods to achieve this sapling The first ethod develops on an idea of Ruján (997) (later followed up by a kernel version of the algorith in Ruján and Marchand, 2) that is based on the idea of playing billiards in version space V (z), ie after entering the version space with a very siple learning algorith such as the kernel perceptron (see Algorith ) the classifier w isconsidered as a billiardball and isbounced fora while within the convex polyhedron V (z) If this billiard is ergodic with respect to the unifor distribution over V (z), ie the travel tie of the billiard ball spent in a subset W V (z) is proportional to W V (z), then averaging over the trajectory of the billiard ball leads in the liit of an infinite nuber of bounces to the centre of ass of version space The second ethod presented tries to overcoe the large coputational deands of the billiard ethod by only approxiately achieving a unifor sapling of version space The idea is to use the perceptron 7 Note that by Theore 2 it suffices to saple fro the projection of the version space onto W x 254

11 BAYES POINT MACHINES Figure 2: Plot of a version space (convex polyhedron containing the black dot) V (z) in a 3 diensional feature space K Each hyperplane is defined by a training exaple via its noral vector y i x i learning algorith in dual variables with different perutations Π : {,,} {,,} so as to obtain different consistent classifiers w i V (z) (see Watkin, 993, for a siilar idea) Obviously, the nuber of different saples obtained is finite and thus it is ipossible to achieve exactness of the ethod in the liit of considering all perutations Nevertheless, we shall deonstrate that in particular for the task of handwritten digit recognition the achieved perforances are coparable to state-of-the-art learning algoriths Finally, we would like to reark that recently there have been presented other efficient ethods to estiate the Bayes point directly (Rychetsky et al, 2; Minka, 2) The ain idea in Rychetsky et al (2) is to work out all corners w i of version space and average over the in order to approxiate the centre of ass of version space Note that there are exactly corners because the i th corner w i satisfies x j,w i = for all j i and yi x i,w i > If X =(x,,x ) is the N atrix of apped training points x =(x,,x ) flipped to their correct side and we use the approach (7) for w this siplifies to X w i = X Xα i = Gα i =(,,,y i,,) =: y i e i where the rhs is the i th unit vector ultiplied by y i As a consequence, the expansion coefficients α i of the i th corner w i can easily be coputed as α i = y i G e i and then need to be noralised such that w i = The difficulty with this approach, however, is the fact that the inversion of the Gra atrix G is O ( 3) and is thus as coputationally coplex as support vector learning while not enjoying the anytie property of a sapling schee The algorith presented in Minka (2, Chapter 5) (also see Opper and Winther, 2, for an equivalent ethod) uses the idea of approxiating the posterior easure P W Z =z by a product of Gaussian densities so that the centre of ass can be coputed analytically Although the approxiation of the cut-off posterior over P W Z =z resulting fro the delta-peaked likelihood given in Definition 6 by Gaussian easures sees very crude at first glance, Minka could show that his ethod copares favourably to the results presented in this paper 3 Playing Billiards in Version Space In this subsection we present the billiard ethod to estiate the Bayes point, ie the centre of ass of version space when assuing a PAC likelihood and a unifor prior P W over weight vectors of unit length (the pseudo 255

12 HERBRICH, GRAEPEL, CAMPBELL º Û Ý Ü Û ¼ Û Ý Ü Û ¼ Û Ý ¾ Ü ¾ Û ¼ ½ Û Ñ ¼ ¾ Û Ý ½ Ü ½ Û ¼ Figure 3: Scheatic view of the kernel billiard algorith Starting at b V (z) a trajectory of billiard bounces b,,b 5, is calculated and then averaged over so as to obtain an estiate ŵ c of the centre of ass of version space code is given on page 275) By Theore 2 each position b of the billiard ball and each estiate w i of the centre of ass of V (z) can be expressed as linear cobinations of the apped input points, ie w = i= α i x i, b = i= γ i x i, α,γ R Without loss of generality we can ake the following ansatz for the direction vector v of the billiard ball v = i= β i x i, β R Using this notation inner products and nors in feature space K becoe b,v = i= j= γ i β j k (x i,x j ), b 2 = i, j= γ i γ j k (x i,x j ), (2) where k : X X R is a Mercer kernel and has to be chosen beforehand At the beginning we assue that w = α = Before generating a billiard trajectory in version space V (z) we first run any learning algorith to find an initial starting point b inside the version space (eg support vector learning or the kernel perceptron (see Algorith )) Then the kernel billiard algorith consists of three steps (see also Figure 3): Deterine the closest boundary in direction v i starting fro current position b i Since it is coputationally very deanding to calculate the flight tie of the billiard ball on geodesics of the hyper-sphere W x (see also Neal, 997) we ake use of the fact that the shortest distance in Euclidean space (if it exists) is also the shortest distance on the hyper-sphere W x Thus, we have for the flight tie τ j of the billiard ball at position b i in direction v i to the hyperplane with noral vector y j x j bi,x j τ j = (2) vi,x j After calculating all flight ties, we look for the sallest positive, ie c = argin τ j j {i τ i > } 256

13 BAYES POINT MACHINES Deterining the closest bounding hyperplane in Euclidean space rather than on geodesics causes probles if the surface of the hyper-sphere W x is alost orthogonal to the direction vector v i, in which case τ c If this happens we randoly generate a direction vector v i pointing towards the version space V (z) Assuing that the last bounce took place at the hyperplane having noral y c x c this condition can easily be checked by y c v i,x c > (22) Note that since the saples are taking fro the bouncing points the above procedure of dealing with the curvature of the hyper-sphere does not constitute an approxiation but is exact An alternative ethod of dealing with the proble of the curvature of the hyper-sphere W can be found in Minka (2, Section 58) 2 Update the billiard ball s position to b i+ and the new direction vector to v i+ The new point b i+ and the new direction v i+ are calculated fro b i+ = b i + τ c v i, (23) v i+ = v i 2 v i,x c x c 2 x c (24) Afterwards the position b i+ and the direction vector v i+ need to be noralised This is easily achieved by equation (2) 3 Update the centre of ass w i of the whole trajectory by the new line segent fro b i to b i+ calculated on the hyper-sphere W x Since the solution w lies on the hyper-sphere W x (see Theore ) we cannot siply update the centre of ass using a weighted vector addition Let us introduce the operation µ acting on vectors of unit length This function has to have the following properties s µ t 2 =, t s µ t = µ t s, s µ t = ρ ( s,t,µ)s + ρ 2 ( s,t,µ)t, ρ ( s,t,µ), ρ 2 ( s,t,µ) This rather arcane definition ipleents a weighted addition of s and t such that µ is the fraction between the resulting chord length t s µ t and the total chord length t s In Appendix A4 it is shown that the following forulae for ρ ( s,t,µ) and ρ 2 ( s,t,µ) ipleent such a weighted addition ρ ( s,t,µ) = µ µ2 µ 2 s,t 2, s,t + ρ 2 ( s,t,µ) = ρ ( s,t,µ) s,t ± ( µ 2 ( s,t ) ) By assuing a constant line density on the anifold V (z) the whole line between b i and b i+ can be represented by the idpoint on the anifold V (z) given by = b i + b i+ b i + b i+ Thus, one updates the centre of ass of the trajectory by ( Ξ i w i+ = ρ ( w i,, )w i + ρ 2 w i,, Ξ i + ξ i Ξ i Ξ i + ξ i ), 257

14 HERBRICH, GRAEPEL, CAMPBELL where ξ i = b i b i+ is the length of the trajectory in the i th step and Ξ i = i j= ξ j for the accuulated length up to the i th step Note that the operation µ is only an approxiation to addition operation we sought because an exact weighting would require the arc lengths rather than chord lengths As a stopping criterion we suggest coputing an upper bound on ρ 2, the weighting factor of the new part of the trajectory If this value falls below a pre-specified threshold (TOL) we stop the algorith Note that the increase in Ξ i will always lead to terination 32 Large Scale Bayes Point Machines Clearly, all we need for estiating the centre of ass of version space (2) is a set of unit length weight vectors w i drawn uniforly fro V (z) In order to save coputational resources it ight be advantageous to achieve a unifor saple only approxiately The classical perceptron learning algorith offers the possibility to obtain up to! different classifiers in version space siply by learning on different perutations of the training saple Of course due to the sparsity of the solution the nuber of different classifiers obtained is usually considerably less A classical theore to be found in Novikoff (962) guarantees the convergence of this procedure and furtherore provides an upper bound on the nuber t of istakes needed until convergence More precisely, if there exists a classifier w SVM with argin γ z (w SVM ) > (see (3)) then the nuber of istakes until convergence which is an upper bound on the sparsity of the solution is not ore than ς 2 γ 2 z (w SVM ), where ς is the sallest real nuber such that x i K ς The quantity γ z (w SVM ) is axiised for the solution w SVM found by the support vector achine, and whenever the support vector achine is theoretically justified by results fro learning theory (see Shawe-Taylor et al, 998; Vapnik, 998) the ratio ς 2 γ 2 z (w SVM ) is considerably less than, say d Algorithically, we can benefit fro this sparsity by the following trick : since w = α i x i i= all we need to store is the diensional vector α Furtherore, we keep track of the diensional vector o of real-valued outputs o i = x i,w t = α j k (x i,x j ) of the current solution at the i th training point By definition, in the beginning α = o = Now, if o i y i < we update α i by α i + y i and update o by o j o j + y i k (x i,x j ) which requires only kernel calculations (the evaluation of the i th row of the Gra atrix G) In suary, the eory requireent of this algorith is 2 and the nuber of kernel calculations is not ore than d As a consequence, the coputational requireent of this algorith is no ore than the coputational requireent for the evaluation of the argin γ z (w SVM )! We suggest to use this efficient perceptron learning algorith in order to obtain saples w i for the coputation of the centre of ass (2) In order to investigate the usefulness of this approach experientally, we copared the distribution of generalisation errors of saples obtained by perceptron learning on peruted training saples with saples obtained by a full Gibbs sapling (see Graepel and Herbrich, 2, for details on the kernel Gibbs sapler) For coputational reasons, we used only 88 training patterns and 453 test patterns of the classes and 2 fro the MNIST data set 8 In Figure 4 (a) and (b) we plotted the distribution over rando saples using the kernel 9 k ( x,x ) = ( x,x + ) 5 (25) j= Using a quantile-quantile (QQ) plot technique we can copare both distributions in one graph (see Figure 4 (c)) These plots suggest that by siple perutation of the training saple we are able to obtain a saple of classifiers exhibiting a siilar distribution of generalisation error to the one obtained by tie-consuing Gibbs sapling 8 This data set is publicly available at 9 We decided to use this kernel because it showed excellent generalisation perforance when using the support vector achine 258

15 BAYES POINT MACHINES frequency frequency kernel perceptron generalisation error generalisation error kernel Gibbs sapler (a) (b) (c) Figure 4: (a) Histogra of generalisation errors (estiated on a test set) using a kernel Gibbs sapler (b) Histogra of generalisation errors (estiated on a test set) using a kernel perceptron (c) QQ plot of distributions (a) and (b) The straight line indicates that the two distributions only differ by an additive and ultiplicative constant, ie they exhibit the sae rate of decay A very advantageous feature of this approach as copared to support vector achines are its adjustable tie and eory requireents and the anytie availability of a solution due to sapling If the training saple grows further and we are not able to spend ore tie learning, we can adjust the nuber of saples w used at the cost of slightly worse generalisation error (see also Section 4) 33 Extension to Training Error To allow for training errors we recall that the version space conditions are given by (x i,y i ) z : y i x i,w = y i j=α j k (x i,x j ) > (26) Now we introduce the following version space conditions in place of (26): (x i,y i ) z : y i j=α j k (x i,x j ) > λy i α i k(x i,x i ), (27) where λ is an adjustable paraeter related to the softness of version space boundaries Clearly, considering this fro the billiard viewpoint, equation (27) can be interpreted as allowing penetration of the walls, an idea already hinted at in Ruján (997) Since the linear decision function is invariant under any positive rescaling of expansion coefficients α, a factor α i on the right hand side akes λ scale invariant as well Although other ways of incorporating training errors are conceivable our forulation allows for a siple odification of the algoriths described in the previous two subsections To see this we note that equation (27) can be rewritten as ) > (x i,y i ) z : y i ( j=α j ( + λi i= j )k (x i,x j ) Hence we can use the above algoriths but with an additive correction to the diagonal ters of the Gra atrix This additive correction to the kernel diagonals is siilar to the quadratic argin loss used to introduce a soft argin during training of support vector achines (see Cortes, 995; Shawe-Taylor and Cristianini, 2) Another insight into the introduction of soft boundaries coes fro noting that the distance between two points x i and x j in feature space K can be written x i x j 2 = x i 2 + x j 2 2 xi,x j, 259

16 HERBRICH, GRAEPEL, CAMPBELL λ = λ = 5 λ = λ = 5 λ = 2 λ = 25 5 Figure 5: Paraeter spaces for a two diensional toy proble obtained by introducing training error via an additive correction to the diagonal ter of the kernel atrix In order to visualise the resulting paraeter space we fixed = 3 and noralised all axes by the product of eigenvalues λ λ 2 λ 3 See text for further explanation which in the case of points of unit length in feature space becoes 2( + λ k (x i,x j )) Thus, if we add λ to the diagonal eleents of the Gra atrix, the points becoe equidistant for λ This would give the resulting version space a ore regular shape As a consequence, the centre of the largest inscribable ball (support vector achine solution) would tend towards the centre of ass of the whole of version space We would like to recall that the effective paraeter space of weight vectors considered is given by { } W x := w = α i x i w 2 = α i α j xi,x j = i= i= In ters of α this can be rewritten as { α R α Gα = } G ij = x i,x j = k (xi,x j ) Let us represent the Gra atrix by its spectral decoposition, ie G = UΛU where U U = I and Λ = diag(λ,,λ ) being the diagonal atrix of eigenvalues λ i Thus we know that the paraeter space is the set of all coefficients α = U α which fulfil { α R : α Λ α = } j= This is the defining equation of an diensional axis parallel ellipsoid Now adding the ter λ to the diagonal of G akes G a full rank atrix (see Micchelli, 986) In Figure 5 we plotted the paraeter space for a 2D toy proble using only = 3 training points Although the paraeter space is 3 diensional for all λ > we obtain a pancake like paraeter space for sall values of λ Forλ the set α of adissible coefficients becoes the diensional ball, ie the training exaples becoe ore and ore orthogonal with increasing λ The way we incorporated training errors corresponds to the choice of a new kernel given by k λ (x, x) := k (x, x)+λ I x= x 26

17 BAYES POINT MACHINES Figure 6: Version spaces V (z) for two 3 diensional toy probles (Left) One can see that the approxiation of the Bayes point (diaond) by the centre of the largest inscribable ball (cross) is reasonable if the version space is regularly shaped (Right) The situation changes in the case of an elongated and asyetric version space V (z) Finally, note that this odification of the kernel has no effect on new test points x / x that are not eleents of the training saple x For an explanation of the effect of λ in the context of Gaussian processes see Opper and Winther (2) 4 Experiental Results In this section we present experiental results both on University of California, Irvine (UCI) benchark datasets and on two bigger task of handwritten digit recognition, naely US postal service (USPS) and odified National Institute of Standards (MNIST) digit recognition tasks We copared our results to the perforance of a support vector achine using reported test set perforance fro Rätsch et al (2) (UCI) Schölkopf (997, p 57) (USPS) and Cortes (995) (MNIST) All the experients were done using Algorith 2 in Appendix B 4 Artificial Data For illustration purposes we setup a toy dataset of training and test points in R 3 The data points were uniforly generated in [,] 3 and labelled by a randoly generated linear decision rule using the kernel k (x, x) = x, x In Figures 6 we illustrate the potential benefits of a Bayes point achine over a support vector achine for elongated version spaces By using the billiard algorith to estiate the Bayes point (see Subsection 3), we were able to track all positions b i where the billiard ball hits a version space boundary This allows us to easily visualise the version spaces V (z) For the exaple illustrated in Figure 6 (right) the support vector achine and Bayes point solutions with hard argins/boundaries are far apart resulting in a noticeable reduction in generalisation error of the Bayes point achines (8%) copared to the support vector achine (5%) solution whereas for regularly shaped version spaces (Figure 6 (left)) the difference is negligible (6% to 6%) publicly available at 26

18 HERBRICH, GRAEPEL, CAMPBELL SVM BPM Y 2 Y X 2 2 X Figure 7: Decision functions for a 2D toy proble of a support vector achine (SVM) (left) and Bayes point achine (BPM) (right) using hard argins (λ = ) and RBF kernels with σ = Note that the Bayes point achine result in a uch flatter function sacrificing argin (γ z (w SVM )=36 γ z (w c )=2) for soothness In a second illustrative exaple we copared the soothness of the resulting decision function when using kernels both with support vector achines and Bayes point achines In order to odel a non-linear decision surface we used the radial basis function (RBF) kernel ( ) x x 2 k (x, x)=exp 2σ 2 (28) Figure 7 shows the resulting decision functions in the hard argin/boundary case Clearly, the Bayes point achine solution appears uch soother than the support vector achine solution although its geoetrical argin of 2 is significantly saller The above exaples should only be considered as aids to enhance the understanding of the Bayes point achines algorith s properties rather than strict arguents about general superiority 42 UCI Benchark Datasets To investigate the perforance on real world datasets we copared hard argin support vector achines to Bayes point achines with hard boundaries (λ = ) when using the kernel billiard algorith described in Subsection 3 We studied the perforance on 5 standard bencharking datasets fro the UCI Repository, and banana and wavefor, two toy datasets (see Rätsch et al, 2) In each case the data was randoly partitioned into training and test sets in the ratio 6%:4% The eans and standard deviations of the average generalisation errors on the test sets are presented as percentages in the coluns headed SVM (hard argin) and BPM (λ = ) in Table As can be seen fro the results, the Bayes point achine outperfors support vector achines on alost all datasets at a statistically significant level Note, however, that the result of the t-test is strictly valid only under the assuption that training and test data were independent an assuption which ay be violated by the procedure of splitting the one data set into different pairs of training and test sets (Dietterich, 998) Thus, the resulting p values should serve only as an indication for the significance of the result In order to deonstrate the effect of positive λ (soft boundaries) we trained a Bayes point achine with soft boundaries and copared it to training a support vector achine with soft argin using the sae Gra 262

19 BAYES POINT MACHINES SVM (hard argin) BPM (hard boundary) σ p-value Heart 254±4 228±34 Thyroid 53±24 44±2 3 Diabetes 33±24 32±25 5 Wavefor 3± 2±9 2 Banana 62±5 5±4 5 Sonar 54±37 59±38 Ionosphere 9±25 5± Table : Experiental results on seven benchark datasets We used the RBF kernel given in (28) with values of σ found optial for SVMs Shown is the estiated generalisation error in percent The standard deviation was obtained on different runs The final colun gives the p values of a paired t test for the hypothesis BPM is better than SVM indicating that the iproveent is statistically significant atrix (see equation (27)) It can be shown that such a support vector achine corresponds to a soft argin support vector achine where the argin slacks are penalised quadratically (see Cortes, 995; Shawe-Taylor and Cristianini, 2; Herbrich, 2) In Figure 8 we have plotted the generalisation error as a function of λ for the toy proble fro Figure 6 and the dataset heart using the sae setup as in the previous experient We observe that the support vector achine with an l 2 soft argin achieves a iniu of the generalisation error which is close to, or just above, the iniu error which can be achieved using a Bayes point achine with positive λ This ay not be too surprising taking the change of geoetry into account (see Section 33) Thus, also the soft argin support vector achine approxiates Bayes point achine with soft boundaries Finally we would like to reark that the running tie of the kernel billiard was not uch different fro the running tie of our support vector achine ipleentation We did not use any chunking or decoposition algoriths (see, eg Osuna et al, 997; Joachis, 999; Platt, 999) which in case of support vector achines would have decreased the running tie by orders of agnitudes The ost noticeable difference in running tie was with the wavefor and banana dataset where we are given = 4 observations This can be explained by the fact that the coputational effort of the kernel billiard ethod is O ( B 2) where B is the nuber of bounces As we set our tolerance criterion TOL for stopping very low ( 4 ), the approxiate nuber B of bounces for these datasets was B Hence, in contrast to the coputational effort of using the support vector achines of O ( 3) the nuber B of bounces lead to a uch higher coputational deand when using the kernel billiard 43 Handwritten Digit Recognition For the two tasks we now consider our inputs are n n grey value iages which were transfored into n 2 diensional vectors by concatenation of the rows The grey values were taken fro the set {,,255} All iages were labelled by one of the ten classes to 9 For each of the ten classes y = {,,9} we ran the perceptron algorith L = ties each tie labelling all training points of class y by + and the reaining training points by On a Pentiu III 5 MHz with 28 MB eory each learning trial took 2 inutes (MNIST) or 2 inutes (USPS), respectively For the classification of a test iage x Note, however, that we ade use of the fact that 4% of the grey values of each iage are since they encode background Therefore, we encoded each iage as an index-value list which allows uch faster coputation of the inner products x, x and speeds up the algorith by a factor of

20 HERBRICH, GRAEPEL, CAMPBELL classification error 5 5 SVM BPM classification error SVM BPM λ λ Figure 8: Coparison of soft boundary Bayes point achine with soft argin support vector achine Plotted is the generalisation error versus λ for a toy proble using linear kernels (left) and the heart dataset using RBF kernels with σ = 3 (right) The error bars indicate one standard deviation of the estiated ean we calculated the real-valued output of all different classifiers 2 by f i (x) = x,w i w i x = r= s= (α i ) j k (x j,x) j= (α i ) r (α i ) s k (x r,x s ), k (x,x) where we used the kernel k given by (25) Here, (α i ) j refers to the expansion coefficient corresponding to the i th classifier and the j th data point Now, for each of the ten classes we calculated the real-valued decision of the Bayes point estiate ŵ c,y by 3 f bp,y (x)= x,ŵ c,y = L In a Bayesian spirit, the final decision was carried out by h bp (x) := argax y {,,9} L i= x,w i+yl f bp,y (x) Note that f bp,y (x) can be interpreted as an (unnoralised) approxiation of the posterior probability that x is of class y when restricted to the function class () (see Platt, 2) In order to test the dependence of the generalisation error on the agnitude ax y f bp,y (x) we fixed a certain rejection rate r [,] and rejected the set of r test points with the sallest value of ax y f bp,y (x) MNIST Handwritten Digits In the first of our large scale experient we used the full MNIST dataset with 6 training exaples and test exaples of grey value iages of handwritten digits The plot resulting fro learning only consistent classifiers per class and rejection based on the realvalued output of the single Bayes points is depicted in Figure 9 (left) As can be seen fro this plot, even without rejection the Bayes point has excellent generalisation perforance when copared to support vector achines which achieve a generalisation error of 4 4% Furtherore, rejection based on the real-valued 2 For notational siplicity we assue that the first L classifiers are classifiers for the class, the next L for class and so on 3 Note that in this subsection y ranges fro {,,9} 4 The result of % with the kernel (25) and a polynoial degree of four could not be reproduced and is thus considered invalid (personal counication with P Haffner) Note also that the best results with support vector achines were obtained when using a soft argin 264

21 BAYES POINT MACHINES generalisation error rejection rate MNIST generalisation error rejection rate USPS Figure 9: Generalisation error as a function of the rejection rate for the MNIST and USPS data set (Left) For MNIST, the support vector achine achieved 4% without rejection as copared to 46% for the Bayes point achine Note that by rejection based on the real-valued output the generalisation error could be reduced to % indicating that this easure is related to the probability of isclassification of single test points (Right) On USPS, the support vector achine achieved 46% without rejection as copared to 473% for the Bayes point achine output f bp (x) turns out to be excellent thus reducing the generalisation error to % One should also bear in ind that the learning tie for this siple algorith was coparable to that of support vector achines which need 8 hours per digit 5 (see Platt, 999, p 2, Table 22) USPS Handwritten Digits In the second of our large scale experients we used the USPS dataset with 729 training exaples an 27 test exaples of 6 6 grey value iages of handwritten digits The resulting plot of the generalisation error when rejecting test exaples based on the real-valued outputs of the single Bayes points is shown in Figure 9 (right) Again, the resulting classifier has a generalisation error perforance coparable to support vector achines whose best results are 45% when using a soft argin and 46% in the hard argin scenario In Figure we plotted the 25 ost coonly used iages x i x with non-zero coefficients (α j ) i across the different classifiers learned Though no argin axiisation was perfored it turns out that in accordance with the support vector philosophy these are the hard patterns in the datasets with respect to classification Moreover, as can be seen fro the st, 6 th and 8 th exaple there is clearly noise in the dataset which could potentially be taken into account using the techniques outlined in Subsection 33 at no extra coputational cost 5 Discussion and Conclusion In this paper we presented two estiation ethods for the Bayes point for linear classifiers in feature spaces We showed how the support vector achine can be viewed as an (spherical) approxiation ethod to the Bayes point hyperplane By randoly generating consistent hyperplanes playing billiards in version space we showed how to stochastically approxiate this point In the field of Markov Chain Monte Carlo ethods such approaches are known as reflective slice sapling (Neal, 997) Current investigations in this field include the question of ergodicity of such ethods The second ethod of estiating the Bayes point consists of running the perceptron algorith with several perutation of the training saple in order to average over the saple thereby obtained By its inherent siplicity it is uch ore aenable to large scale probles and in particular copares favourably to state-of-the-art ethods such as support vector learning 5 Recently, DeCoste and Schölkopf (22) deonstrated that an efficient ipleentation of the support vector achine reduces the aount of learning tie to hour per digit 265

22 HERBRICH, GRAEPEL, CAMPBELL 39/5 33/3 32/6 3/4 3/8 3/7 3/3 3/5 3/7 29/4 29/4 28/9 28/4 28/2 28/4 27/7 27/4 27/8 27/4 27/4 27/2 26/3 26/6 26/4 26/2 Figure : Shown are the 25 ost coonly used exaples x i x (non-zero coefficients (α j ) i for any j {,,}) fro the USPS dataset across the different classifiers learned using the perceptron learning algorith The two nubers below each digit give the nuber of classifiers they appeared in and the true class y {,,9} in the training saple Interestingly, in accordance with the philosophy behind support vectors these are the hardest patterns with respect to the classification task although no explicit argin axiisation was perfored 266

23 BAYES POINT MACHINES The centre of ass approach ay also be viewed as a ultidiensional extension of the Pitan estiator (Pitan, 939) if the weight vector w is thought of as a location paraeter to be estiated fro the data Unfortunately, neither the centre of ass of version space nor the support vector solution are invariant under general linear transforations of the data but only under the class of orthogonal transforations (see, eg Schölkopf, 997) For the centre of ass this is due to the noralisation of the weight vector Note that it is this noralisation that akes it conceptually hard to incorporate a bias diension into our fraework We presented a derivation of the Bayes point as the optial projection of the Bayes classification strategy This strategy is known to be the optial strategy, ie the classification strategy which results in classifications with the sallest generalisation error, when considering the generalisation error on average over the rando draw of target hypothesis according to the prior P H It is worthwhile to ention, however, that recent results in the PAC counity allow one to obtain perforance guarantees for the Bayesian classification strategy even for single target hypotheses h P H which hold for ost rando draws of the training saple used (see McAllester, 998, 999) The results indicate that the fraction of the volue of paraeter space to the volue of version space plays a crucial role in the generalisation error of Bayesian classifiers It could be shown elsewhere (Herbrich et al, 999b) that these bounds can be extended to single classifiers and then involve the volue of the largest point syetric body around the classifier fully contained in version space (see Figure 2) These results ay additionally otivate the centre of ass as a classifier with good volue ratio and thus good generalisation The results also indicate that under circustances where the shape of the version space is alost spherical the classical support vector achine gives the best result (see, eg Herbrich and Graepel, 2) In a series of experients it has been shown that the Bayes point, ie the centre of ass of version space, has excellent generalisation perforance even when only broadly approxiated by the average classifier found with siple perceptrons Furtherore, it was be deonstrated that the real-valued output of the Bayes point on new test points serves as a reliable confidence easure on its prediction An interesting feature of the Bayes point sees to be that the hardest patterns in the training saple tend to have the largest contribution in the final expansion too This is in accordance with the support vector philosophy although the Bayes point achine algorith does not perfor any kind of argin axiisation explicitly Bayes points in feature space constitute an interesting bridge between the Bayesian approach to achine learning and statistical learning theory In this paper we have shown that they outperfor hard argin support vector achines It is well known that the introduction of a soft argin iproves the generalisation perforance of support vector achines on ost datasets by allowing for training errors Consequently, we introduced a echanis for Bayesian learning with training errors aditted A coparison of the generalisation perforance of the two types of systes shows that they exhibit a uch closer generalisation perforance than in the hard boundary/argin case Although it is generally believed that sparsity in ters of the expansion coefficients α is an indicator for good generalisation (see, eg Littlestone and Waruth, 986; Herbrich et al, 2b) the algoriths presented show that also dense classifiers exhibit a good generalisation perforance An interesting question arising fro our observation is therefore, which properties of single classifiers in version space are responsible for good generalisation? Acknowledgeents This work was partially done during a research stay of Ralf Herbrich at University of Bristol and Royal Holloway University London in 998 He would like to thank Colin Capbell and John Shawe-Taylor for the excellent research environent and also for the war hospitality during that stay We are also greatly indebted to Matthias Burger, Søren Fiig Jarner, Ulrich Kockelkorn, Klaus Oberayer, Manfred Opper, Craig Saunders, Peter Bollann-Sdorra, Matthias Seeger, John Shawe-Taylor, Alex Sola, Jason Weston, Bob Williason and Hugo Zaragoza for fruitful and inspiring discussions Special thanks go to Patrick Haffner for pointing out the speed iproveent by exploiting sparsity of the MNIST and USPS iages and to Jun Liao for pointing out a istake in Algorith 2 Finally we would like to thank Chris Willias and the three anonyous reviewers for their coents and suggestions that greatly iproved the anuscript 267

24 HERBRICH, GRAEPEL, CAMPBELL Ü Ü ¼ Û ¼ ««Ú Û Ü Ü ¼ Ú ¼ Figure : The fraction of points on the circle which are differently classified by w and v is depicted by the solid black arc Note that this fraction is in general given by 2α 2π = α π = arccos( w,v ) π Appendix A Proofs A Convergence of Centre of Mass to the Bayes Point In this section we present the proof of Theore We start with a siple lea Lea 3 (Generalisation Error for Spherical Distributions in Feature Space) Suppose we are given a fixed apping φ : X K l N 2 resulting in {x := φ(x) x X } Furtherore let us assue that P X is governed by () Then, for all w, w =, and v, v =, it holds true that E X [l (sign( X,w ),sign( X,v ))] = arccos( w,v ) π Proof For a fixed value r R + let us consider all x such that x 2 = r 2 Givenw K and v K we consider the projection P w,v : K K into the linear space spanned by w and v and its copleent P w,v, ie x K : x = P w,v (x)+p w,v (x) Then for w (and v) it holds true that ( ) sign( x,w ) = sign P w,v (x)+p w,v (x),w ( ) = sign P w,v (x),w + P w,v (x),w = sign( P w,v (x),w ) Hence for any value of r R + the notion of Figure applies and gives r R + : E X X =r [l (sign( X,w ),sign( X,v ))] = arccos( w,v ) π 268

25 BAYES POINT MACHINES Thus integrating over r results in + 2 E X [l (sign( X,w ),sign( X,v ))] = exp ( r 2) arccos( w,v ) dr π π = arccos( w,v ) π According to Definition 9 and the previous lea, in order to find the Bayes point w bp for a given training saple z we need to find the vector v which iniises the following function [ E X EW Z =z [l (sign( X,v ),sign( X,W ))] ] = E W Z =z [E X [l (sign( X,v ),sign( X,W ))]] [ ] arccos( v,w ) = E W Z =z π subject to the constraint v = Hence we have to deterine the saddle point of the following Lagrangian [ ] arccos( v,w ) L exact (v,α)=e W Z =z + α( v,v ), π wrt v and α The difficulty with this expression, however, is that by W v L exact (v,α) vbp = E W Z =z ( v bp,w ) + 2αv bp =, 2 2αv bp = E W Z W =z ( v bp,w ), 2 the resulting fix-point equations for all coponents v i are coupled because of the ( ( v,w ) 2 ) 2 ter within the expectation thus involving all coponents Nevertheless, we can find a good proxy for arccos( v,w )/π by ( v,w )/2 (see Figure 2 on page 27) This is ade ore precise in the following lea Lea 4 (Quality of Euclidean Distance Proxy) Suppose we are given a fixed apping φ : X K l N 2 resulting in {x := φ(x) x X } Furtherore let us assue that P X is governed by () Given a fixed vector v K of unit length, ie v =, let us finally assue that in v,w > ε (29) w:p W Z =z (w)> Then we know that [ ] E W Z =z [E X [l (sign( X,W ),sign( X,v ))]] E W Z =z W v 2 + κ(ε), 4 [ ] E W Z =z [E X [l (sign( X,W ),sign( X,v ))]] E W Z =z W v 2 κ(ε), 4 where { arccos(ε) κ(ε) := π ε 2 if ε < 23 otherwise 269

26 HERBRICH, GRAEPEL, CAMPBELL f(x) arccos(x) π x x Figure 2: Plot of the functions arccos(x)/π and ( x)/2 vs x As we can see, translating the latter function by not ore than shows that it is a both an upper and lower bound for arccos(x)/π and thus a reasonable proxy Proof Using Lea 3 we only need to show that under the assuption (29) it holds true that 4 v w 2 κ(ε) arccos( v,w ) π 4 v w 2 + κ(ε) At first we notice that 4 v w 2 = ( v 2 2 v,w + w 2) = v,w 4 2 Thus let us deterine the axial difference in f (x)= arccos(x) π x 2 in the interval (,) A straightforward calculation reveals that the axiu occurs at x = 4π 2 and is < f (x ) < Hence whenever ε < 4π 2 < 23 we can directly use f (x) which itself is in the worst case upper bounded by Noticing that x (,) : f (x)= f ( x) proves the lea If we replace arccos( v,w )/π by w v 2 /4 on the basis of the previous lea we obtain the sipler proble of deterining the saddle point of the Lagrangian [ ] L approx (v,α)=e W Z =z W v 2 + α ( v v ) 4 Taking the derivative wrt v thus yields v L approx (v,α) vc = E W Z =z [ 2 ] W + 2αv c =, [ ] 2αv c = E W Z =z 2 W (3) 27

27 BAYES POINT MACHINES The value of α is deterined by ultiplying the whole expression by v c and utilising the constraint v cv c =, ie [ ] 2αv cv c = 2α = E W Z =z 2 W,v c Resubstituting this expression into (3) finally yields α = 4 E W Z =z [ W,v c ] v c = 4α E E W Z W Z =z [W]= =z [W], EW Z =z [W],v c whose only solution is given by v c = E W Z =z [W] E W Z =z [W] A2 Proof of Theore Proof Given z =(x,y) Z we consider the projection P x : W K that aps all vectors of unit length in the linear span of the points {φ(x ),,φ(x )} = {x,,x } and the projection P x : W K that aps into the copleent of the linear span of {x,,x } Thus, for any vector w K we have w = P x (w)+p x (w) which iediately iplies that for all (x i,y i ) z x i,w = x i,p x (w)+p x (w) = x i,p x (w) + x i,p x (w) = x i,p x (w) Suppose w z is a iniiser of (6) but P x (w z ) w z, ie P x (w z ) < w z = Then c(x,y,( x,w z ),, x,w z ) = c(x,y,( x,p x (w z ) ),, x,p x (w z ) ) ( ) > c x,y, P x (w z ) ( x,p x (w z ) ),, x,p x (w z ) ( ( ) ) = c x,y, x, P x (w z ),, x, P x (w z ), Px(wz ) Px(wz ) where the second line follows fro the assuption that c is strictly onotonically decreasing in the third arguent We see that w z cannot be the iniiser of (6) and by contradiction it follows that the iniiser ust adit the representation (7) A3 Sufficiency of the Linear Span Proof of Theore 2 Proof Let us rewrite the lhs of (8) w W v 2 dp W Z =z (v) = W Z W =z (v) = 2 2 I sign(yi x i,v )=b i w,v dp W Z W =z (v) b {,+} i= = 2 2C I sign(yi x i,v )=b i w,v f (b i ) dv, b {,+} W i= } {{ } (3) A(w,z,b) 27

28 HERBRICH, GRAEPEL, CAMPBELL where the first line follows by the assuption that w,v W, the second line is true because only one suation b leads to a non-zero product and the third line follows fro dp W Z =z (v) = = i= f (sign(y i x i,v )) W W d w i= f (sign(y i x i,u )) W d w du dv i= f (sign(y i x i,v )) W i= f (sign(y i x i,u )) du dv = C i= f (b i ) dv Let us consider the expression A(w,z,b) For a fixed set z =(x,y) Z let P x : W K and P x : W K be the projection of unit length vectors into L x and its orthogonal copleent, respectively Then, by construction we know that y i x i,v = y i x i,p x (v)+p x (v) = y i x i,p x (v) + y i x i,p x (v) = y i x i,p x (v) As a consequence, sign which iplies that ) ) (y i x i,p x (v)+p x (v) = b i sign (y i x i,p x (v) P x (v) = b i, (32) i= A(w,z,b)= W i= I sign(yi x i,p x (v) )=b i w,p x (v) f (b i ) dv, (33) because by (32) all the inner products with orthogonal coponents are vanishing Noticing that v W : P x (v) we can rewrite (33) as A(w,z,b) = I v+u = K \L x L x I sign(yi x i,u )=b i w,u f (b i ) dudv i= ( ) = I v = r dv I u =r K \L x L x I sign(yi x i,u )=b i w,u f (b i ) dudr i= Lastly, for all u with u = r we use the fact that I sign(yi x i,u )=b i w,u = r I sign(r yi x i, u r )=b i w, u r = r i= i= I sign(yi x i, u r )=b i w, u, (34) r that is the feasibility of a point u does not change under rescaling of u Cobining (34), (33) and (3) we have shown that there exists a constant C R + such that w W v 2 dp W Z =z (v)=c w v 2 dp W Z W =z (v) x A4 A derivation of the operation µ Let us derive operation µ acting on vectors of unit length This function has to have the following properties (see Section 3) s µ t 2 =, (35) t s µ t = µ t s, (36) s µ t = ρ s + ρ 2 t, (37) ρ, ρ 2 (38) 272

29 BAYES POINT MACHINES Here we assue that s 2 = t 2 = Inserting equation (37) into (35) results in ρ s + ρ 2 t 2 = ρ s + ρ 2 t,ρ s + ρ 2 t = ρ 2 + ρ ρ ρ 2 s,t = (39) In a siilar fashion cobining equation (37) and (36) gives t s µ t 2 = µ 2 t s 2 ( ρ 2 )t ρ s 2 = µ 2 t s 2 ( ρ 2 ) 2 2( ρ 2 )ρ s,t + ρ 2 = 2µ 2 ( s,t ) (4) Note that equation (39) is quadratic in ρ 2 and has the following solution ρ 2 = ρ s,t ± } ρ 2 ( s,t )2 ρ 2 {{ + } A (4) Let us insert equation (4) into the lhs of equation (4) This gives the following quadratic equation in ρ ( ρ 2 ) 2 2( ρ 2 )ρ s,t + ρ 2 = ( + ρ s,t A)( A ρ s,t )+ρ 2 = ( A) 2 (ρ s,t ) 2 + ρ 2 = 2 2A = 2µ 2 ( s,t ) Solving this equation for ρ results in ρ = µ µ2 µ 2 s,t 2 s,t + Inserting this forula back into equation (4) we obtain ρ 2 = ρ s,t ± ( µ 2 ( s,t ) ) 273

30 HERBRICH, GRAEPEL, CAMPBELL Appendix B Algoriths Algorith Dual perceptron algorith with perutation Require: A perutation Π : {,,} {,,} Ensure: Existence of a version space V (z), a linearly separable training saple in feature space α = o = repeat for i =,, do if y Π(i) o Π(i) then α i α i + y Π(i) for j =,, do o Π( j) o Π( j) + y Π(i) k ( x Π(i),x Π( j) ) end for end if end for until the if branch was never entered within the for loop return the expansion coefficients α 274

31 BAYES POINT MACHINES Algorith 2 Kernel billiard algorith (in dual variables) Require: A tolerance TOL [,] and τ ax R + Require: Existence of a version space V (z), a linearly separable training saple in feature space Ensure: for all i =,,, y i j= γ jk (x i,x j ) > α =, β = rando, noralise β using equation (2) Ξ = ξ ax =, p in = while ρ 2 (p in,ξ/(ξ + ξ ax )) > TOL do repeat for i =,, do d i = j= γ jk (x j,x i ), ν i = j= β jk (x j,x i ) τ i = d i /ν i end for c = argin i:τi > τ i if τ c τ ax then β = rando, but fulfils equation (22), noralise β using equation (2) else c = c end if until τ c < τ ax γ = γ + τ c β, noralise γ using equation (2) β c = β c 2ν c /k (x c,x c ) ζ = γ + γ, noralise ζ using equation (2) ( ξ = i= j= (γ i γ i ) γ j γ j p = i= j= ζ iα j k (x i,x j ) α = ρ (p, Ξ Ξ+ξ )α + ρ 2 ( p, Ξ Ξ+ξ ) k (x i,x j ) ) ζ p in = in(p, p in ), ξ ax = ax(ξ,ξ ax ), Ξ = Ξ + ξ, γ = γ end while return the expansion coefficients α 275

32 HERBRICH, GRAEPEL, CAMPBELL References J Aitchison Bayesian tolerance regions (with discussion) Journal of the Royal Statistical Society Series B, 26:6 75, 964 C M Bishop Neural Networks for Pattern Recognition Clarendon Press, Oxford, 995 B E Boser, I M Guyon, and V N Vapnik A training algorith for optial argin classifiers In D Haussler, editor, Proceedings of the Annual Conference on Coputational Learning Theory, pages 44 52, Pittsburgh, PA, July 992 ACM Press W Buntine A Theory of Learning Classification Rules PhD thesis, University of Technology, Sydney, Autralia, 992 C Cortes Prediction of Generalization Ability in Learning Machines PhD thesis, Departent of Coputer Science, University of Rochester, 995 R Cox Probability, frequency, and reasonable expectations Aerican Journal of Physics, 4: 3, 946 N Cristianini and J Shawe-Taylor An Introduction to Support Vector Machines Cabridge University Press, Cabridge, UK, 2 D DeCoste and B Schölkopf Training invariant support vector achines Machine Learning, 22 Accepted for publication Also: Technical Report JPL-MLTR--, Jet Propulsion Laboratory, Pasadena, CA, 2 T G Dietterich Approxiate statistical test for coparing supervised classification learning algoriths Neural Coputation, (7): , 998 W Feller An Introduction To Probability Theory and Its Application, volue 2 John Wiley and Sons, New York, 966 T Graepel and R Herbrich The kernel Gibbs sapler In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Inforation Processing Systes 3, pages 54 52, Cabridge, MA, 2 MIT Press T Graepel, R Herbrich, and K Oberayer Bayesian Transduction In S A Solla, T K Leen, and K-R Müller, editors, Advances in Neural Inforation Processing Systes 2, pages , Cabridge, MA, 2 MIT Press D Haussler Convolutional kernels on discrete structures Technical Report UCSC-CRL-99-, Coputer Science Departent, University of California at Santa Cruz, 999 R Herbrich Learning Kernel Classifiers: Theory and Algoriths MIT Press, 2 in press R Herbrich and T Graepel A PAC-Bayesian argin bound for linear classifiers: Why SVMs work In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Inforation Processing Systes 3, pages , Cabridge, MA, 2 MIT Press R Herbrich, T Graepel, and C Capbell Bayes point achines: Estiating the Bayes point in kernel space In Proceedings of IJCAI Workshop Support Vector Machines, pages 23 27, 999a R Herbrich, T Graepel, and C Capbell Bayesian learning in reproducing kernel Hilbert spaces Technical report, Technical University of Berlin, 999b TR

33 BAYES POINT MACHINES R Herbrich, T Graepel, and C Capbell Robust Bayes point achines In Proceedings of ESANN 2, pages 49 54, 2a R Herbrich, T Graepel, and J Shawe-Taylor Sparsity vs large argins for linear classifiers In Proceedings of the Annual Conference on Coputational Learning Theory, pages 34 38, 2b T Joachis Text categorization with support vector achines: Learning with any relevant features In Proceedings of the European Conference on Machine Learning, pages 37 42, Berlin, 998 Springer T Joachis Making large scale SVM learning practical In B Schölkopf, C J C Burges, and A J Sola, editors, Advances in Kernel Methods Support Vector Learning, pages 69 84, Cabridge, MA, 999 MIT Press N Littlestone and M Waruth Relating data copression and learnability Technical report, University of California Santa Cruz, 986 D J C MacKay Bayesian Methods for Adaptive Models PhD thesis, Coputation and Neural Systes, California Institute of Technology, Pasadena, CA, 99 D A McAllester Soe PAC Bayesian theores In Proceedings of the Annual Conference on Coputational Learning Theory, pages , Madison, Wisconsin, 998 ACM Press D A McAllester PAC-Bayesian odel averaging In Proceedings of the Annual Conference on Coputational Learning Theory, pages 64 7, Santa Cruz, USA, 999 J Mercer Functions of positive and negative type and their connection with the theory of integral equations Philosophical Transactions of the Royal Society, London, A 29:45 446, 99 C A Micchelli Algebraic aspects of interpolation Proceedings of Syposia in Applied Matheatics, 36: 8 2, 986 T Minka Expectation Propagation for approxiative Bayesian inference PhD thesis, MIT Media Labs, Cabridge, USA, 2 T M Mitchell Version spaces: a candidate eliination approach to rule learning In Proceedings of the International Joint Conference on Artificial Intelligence, pages 35 3, Cabridge, Massachusetts, 977 IJCAI T M Mitchell Generalization as search Artificial Intelligence, 8(2):22 226, 982 R Neal Bayesian Learning in Neural Networks Springer, 996 R M Neal Markov chain Monte Carlo ethod based on slicing the density function Technical report, Departent of Statistics, University of Toronto, 997 TR 9722 A B J Novikoff On convergence proofs on perceptrons In Proceedings of the Syposiu on the Matheatical Theory of Autoata, volue 2, pages Polytechnic Institute of Brooklyn, 962 M Opper and D Haussler Generalization perforance of Bayes optial classification algoriths for learning a perceptron Physical Review Letters, 66:2677, 99 M Opper and W Kinzel Statistical Mechanics of Generalisation, page 5 Springer, 995 M Opper, W Kinzel, J Kleinz, and R Nehl On the ability of the optial perceptron to generalize Journal of Physics A, 23:58 586, 99 M Opper and O Winther Gaussian processes for classification: Mean field algoriths Neural Coputation, 2(): , 2 277

34 HERBRICH, GRAEPEL, CAMPBELL E E Osuna, R Freund, and F Girosi Support vector achines: Training and applications Technical report, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 997 AI Meo No 62 E J G Pitan The estiation of the location and scale paraeters of a continuous population of any given for Bioetrika, 3:39 42, 939 J Platt Fast training of support vector achines using sequential inial optiization In B Schölkopf, C J C Burges, and A J Sola, editors, Advances in Kernel Methods Support Vector Learning, pages 85 28, Cabridge, MA, 999 MIT Press J Platt Probabilities for SV achines In A J Sola, P L Bartlett, B Schölkopf, and D Schuurans, editors, Advances in Large Margin Classifiers, pages 6 73, Cabridge, MA, 2 MIT Press G Rätsch, T Onoda, and K-R Müller Soft argins for adaboost Machine Learning, 42(3):287 32, 2 P Ruján Playing billiards in version space Neural Coputation, 9:99 22, 997 P Ruján and M Marchand Coputing the Bayes kernel classifier In A J Sola, P L Bartlett, B Schölkopf, and D Schuurans, editors, Advances in Large Margin Classifiers, pages , Cabridge, MA, 2 MIT Press M Rychetsky, J Shawe-Taylor, and M Glesner Direct Bayes point achines In Proceedings of the International Conference on Machine Learning, 2 B Schölkopf Support Vector Learning R Oldenbourg Verlag, München, 997 Doktorarbeit, TU Berlin Download: B Schölkopf, R Herbrich, and A J Sola A generalized representer theore In Proceedings of the Annual Conference on Coputational Learning Theory, 2 B Schölkopf, S Mika, C Burges, P Knirsch, K-R Müller, G Rätsch, and A Sola Input space vs feature space in kernel-based ethods IEEE Transactions on Neural Networks, (5): 7, 999 J Shawe-Taylor, P L Bartlett, R C Williason, and M Anthony Structural risk iniization over datadependent hierarchies IEEE Transactions on Inforation Theory, 44(5):926 94, 998 J Shawe-Taylor and N Cristianini Margin distribution and soft argin In A J Sola, P L Bartlett, B Schölkopf, and D Schuurans, editors, Advances in Large Margin Classifiers, pages , Cabridge, MA, 2 MIT Press A J Sola Learning with Kernels PhD thesis, Technische Universität Berlin, 998 GMD Research Series No 25 P Sollich Probabilistic ethods for support vector achines In S A Solla, T K Leen, and K-R Müller, editors, Advances in Neural Inforation Processing Systes 2, pages , Cabridge, MA, 2 MIT Press V Vapnik The Nature of Statistical Learning Theory Springer, New York, 995 ISBN V Vapnik Statistical Learning Theory John Wiley and Sons, New York, 998 G Wahba Spline Models for Observational Data, volue 59 of CBMS-NSF Regional Conference Series in Applied Matheatics SIAM, Philadelphia, 99 T Watkin Optial learning with a neural network Europhysics Letters, 2:87, 993 C K I Willias Prediction with Gaussian processes: Fro linear regression to linear prediction and beyond In M Jordan, editor, Learning and Inference in Graphical Models, pages MIT Press, 999 C K I Willias and D Barber Bayesian classification with Gaussian processes IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 2(2):342 35,