Bayes Point Machines


 Kristopher Hancock
 1 years ago
 Views:
Transcription
1 Journal of Machine Learning Research (2) Subitted 2/; Published 8/ Bayes Point Machines Ralf Herbrich Microsoft Research, St George House, Guildhall Street, CB2 3NH Cabridge, United Kingdo Thore Graepel Technical University of Berlin, Franklinstr 28/29, 587 Berlin, Gerany Colin Capbell Departent of Engineering Matheatics, Bristol University, BS8 TR Bristol, United Kingdo Editor: Christopher K I Willias Abstract Kernelclassifiers coprise a powerful class of nonlinear decision functions for binary classification The support vector achine is an exaple of a learning algorith for kernel classifiers that singles out the consistent classifier with the largest argin, ie inial realvalued output on the training saple, within the set of consistent hypotheses, the socalled version space We suggest the Bayes point achine as a wellfounded iproveent which approxiates the Bayesoptial decision by the centre of ass of version space We present two algoriths to stochastically approxiate the centre of ass of version space: a billiard sapling algorith and a sapling algorith based on the well known perceptron algorith It is shown how both algoriths can be extended to allow for softboundaries in order to adit training errors Experientally, we find that for the zero training error case Bayes point achines consistently outperfor support vector achines on both surrogate data and realworld benchark data sets In the softboundary/softargin case, the iproveent over support vector achines is shown to be reduced Finally, we deonstrate that the realvalued output of single Bayes points on novel test points is a valid confidence easure and leads to a steady decrease in generalisation error when used as a rejection criterion Introduction Kernel achines have recently gained a lot of attention due to the popularisation of the support vector achine (Vapnik, 995) with a focus on classification and the revival of Gaussian processes for regression (Willias, 999) Subsequently, support vector achines have been odified to handle regression (Sola, 998) and Gaussian processes have been adapted to the proble of classification (Willias and Barber, 998; Opper and Winther, 2) Both schees essentially work in the sae function space that is characterised by kernels and covariance functions, respectively Whilst the foral siilarity of the two ethods is striking, the underlying paradigs of inference are very different The support vector achine was inspired by results fro statistical/pac learning theory while Gaussian processes are usually considered in a Bayesian fraework This ideological clash can be viewed as a continuation in achine learning of the by now classical disagreeent between Bayesian and frequentistic statistics (Aitchison, 964) With regard to algorithics the two schools of thought appear to favour two different ethods of learning and predicting: the support vector counity as a consequence of the forulation of the support vector achine as a quadratic prograing proble focuses on learning as optiisation while the Bayesian counity favours sapling schees based on the Bayesian posterior Of course there exists a strong relationship between the two ideas, in particular with the Bayesian axiu a posteriori (MAP) estiator being the solution of an optiisation proble In practice, optiisation based algoriths have the advantage of a unique, deterinistic solution and the availability of the cost function as an indicator of the quality of the solution In contrast, Bayesian algoriths based on sapling and voting are ore flexible and enjoy the socalled anytie property, providing a c 2 R Herbrich, T Graepel, C Capbell
2 HERBRICH, GRAEPEL, CAMPBELL relatively good solution at any point in tie Often, however, they suffer fro the coputational costs of sapling the Bayesian posterior In this paper we present the Bayes point achine as an approxiation to Bayesian inference for linear classifiers in kernel space In contrast to the Gaussian process viewpoint we do not define a Gaussian prior on the length w of the weight vector Instead, we only consider weight vectors of length w = because it is only the spatial direction of the weight vector that atters for classification It is then natural to define a unifor prior on the resulting ballshaped hypothesis space Hence, we deterine the centre of ass of the resulting posterior that is unifor in version space, ie in the zero training error region It should be kept in ind that the centre of ass is erely an approxiation to the real Bayes point fro which the nae of the algorith was derived In order to estiate the centre of ass we suggest both a dynaic syste called a kernel billiard and an approxiative ethod that uses the perceptron algorith trained on perutations of the training saple The latter ethod proves to be efficient enough to ake the Bayes point achine applicable to large data sets An additional insight into the usefulness of the centre of ass coes fro the statistical echanics approach to neural coputing where the generalisation error for Bayesian learning algoriths has been calculated for the case of randoly constructed and unbiased patterns x (Opper and Haussler, 99) Thus if ζ is the nuber of training exaples per weight and ζ is large, the generalisation error of the centre of ass scales as 44/ζ whereas scaling with ζ is poorer for the solutions found by the linear support vector achine (scales as 5/ζ; see Opper and Kinzel, 995), Adaline (scales as 24/ ζ; see Opper et al, 99) and other approaches Of course any of the viewpoints and algoriths presented in this paper are based on extensive previous work carried out by nuerous authors in the past In particular it sees worthwhile to ention that linear classifiers have been studied intensively in two rather distinct counities: The achine learning counity and the statistical physics counity While it is beyond the scope of this paper to review the entire history of the field we would like to ephasise that our geoetrical viewpoint as expressed later in the paper has been inspired by the very original paper Playing billiard in version space by P Ruján (Ruján, 997) Also, in that paper the ter Bayes point was coined and the idea of using a billiardlike dynaical syste for unifor sapling was introduced Both we (Herbrich et al, 999a,b, 2a) and Ruján and Marchand (2) independently generalised the algorith to be applicable in kernel space Finally, following a theoretical suggestion of Watkin (993) we were able to scale up the Bayes point algorith to large data sets by using different perceptron solutions fro perutations of the training saple The paper is structured as follows: In the following section we review the basic ideas of Bayesian inference with a particular focus on classification learning Along with a discussion about the optiality of the Bayes classification strategy we show that for the special case of linear classifiers in feature space the centre of ass of all consistent classifiers is arbitrarily close to the Bayes point (with increasing training saple size) and can be efficiently estiated in the linear span of the training data Moreover, we give a geoetrical picture of support vector learning in feature space which reveals that the support vector achine can be viewed as an approxiation to the Bayes point achine In Section 3 we present two algoriths for the estiation of the centre of ass of version space one exact ethod and an approxiate ethod tailored for large training saples An extensive list of experiental results is presented in Section 4, both on sall achine learning benchark datasets as well as on large scale datasets fro the field of handwritten digit recognition In Section 5 we suarise the results and discuss soe theoretical extensions of the ethod presented In order to unburden the ain text, the lengthy proofs as well as the pseudocode have been relegated to the appendix We denote n tuples by italic bold letters (eg x =(x,,x n )), vectors by roan bold letters (eg x), rando variables by sans serif font (eg X) and vector spaces by calligraphic capitalised letters (eg X ) The sybols P,E and I denote a probability easure, the expectation of a rando variable and the indicator function, respectively 246
3 BAYES POINT MACHINES 2 A Bayesian Consideration of Learning In this section we would like to revisit the Bayesian approach to learning (see Buntine, 992; MacKay, 99; Neal, 996; Bishop, 995, for a ore detailed treatent) Suppose we are given a training saple z =(x,y)=((x,y ),,(x,y )) (X Y ) of size drawn iid fro an unknown distribution P Z = P XY Furtherore, assue we are given a fixed set H Y X of functions h : X Y referred to as hypothesis space The task of learning is then to find the function h which perfors best on new yet unseen patterns z =(x,y) drawn according to P XY Definition (Learning Algorith) A (deterinistic) learning algorith A : = Z Y X is a apping fro training saples z of arbitrary size N to functions fro X to Y The iage of A, ie {A(z) z Z } Y X, is called the effective hypothesis space H A, of the learning algorith A for the training saple size N If there exists a hypothesis space H Y X such that for every training saple size N we have H A, H we shall oit the indices on H In order to assess to quality of a function h H we assue the existence of a loss function l : Y Y R + The loss l (y,y ) R + is understood to easure the incurred cost when predicting y while the true output was y Hence we always assue that for all y Y, l (y,y)= A typical loss function for classification is the so called zeroone loss l defined as follows Definition 2 (ZeroOne Loss) Given a fixed output space Y, the zeroone loss is defined by l ( y,y ) := I y y Based on the concept of a loss l, let us introduce several quality easures for hypotheses h H Definition 3 (Generalisation and Training Error) Given a probability easure P XY and a loss l : Y Y R + the generalisation error R[h] of a function h : X Y is defined by R[h] := E XY [l (h(x),y)] Given a training saple z =(x,y) (X Y ) of size and a loss l : Y Y R + the training error R ep [h,z] of a function h : X Y is given by R ep [h,z] := l (h(x i ),y i ) i= Clearly, only the generalisation error R[h] is appropriate to capture the perforance of a fixed classifier h H on new patterns z =(x,y) Nonetheless, we shall see that the training error plays a crucial role as it provides an estiate of the generalisation error based on the training saple Definition 4 (Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for any fixed training saple size N the generalisation error R [A] of A is defined by R [A] := E Z [R[A(Z)]], that is, the expected generalisation error of the hypotheses found by the algorith Note that for any loss function l : Y Y R + a sall generalisation error R [A] of the algorith A guarantees a sall generalisation error for ost randoly drawn training saples z because by Markov s inequality we have for ε >, P Z (R[A(Z)] > ε E Z [R[A(Z)]]) ε Hence we can view R [A] also as a perforance easure of A s hypotheses for randoly drawn training saples z Finally, let us consider a probability easure P H over the space of all possible appings fro X to Y Then, the average generalisation error of a learning algorith A is defined as follows 247
4 HERBRICH, GRAEPEL, CAMPBELL Definition 5 (Average Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for each fixed training saple size N the average generalisation error R [A] of A is defined by R [A] := E H [ EZ H=h [ EX [ EY X=x,H=h [l ((A(Z))(x),Y)] ]]], () that is, the average perforance of the algorith s A solution learned over the rando draw of training saples and target hypotheses The average generalisation error is the standard easure of perforance of an algorith A if we have little knowledge about the potential function h that labels all our data expressed via P H Then, the easure () averages out our ignorance about the unknown h thus considering perforance of A on average There is a noticeable relation between R [A] and R [A] if we assue that given a easure P H, the conditional distribution of outputs y given x is governed by P Y X=x (y)=p H (H(x)=y) (2) Under this condition we have that R [A]=R [A] This result, however, is not too surprising taking into account that under the assuption (2) the easure P H fully encodes the unknown relationship between inputs x and outputs y 2 The Bayesian Solution In the Bayesian fraework we are not siply interested in h := argin h H R[h] itself but in our knowledge or belief in h To this end, Bayesians use the concept of prior and posterior belief, ie the knowledge of h before having seen any data and after having seen the data which in the current case is our training saple z It is well known that under consistency rules known as Cox s axios (Cox, 946) beliefs can be apped onto probability easures P H Under these rather plausible conditions the only consistent way to transfer prior belief P H into posterior belief P H Z =z is therefore given by Bayes theore: P Z P H Z =z (h) = H=h (z) [ E H PZ H=h (z) ] P H (h)= PY X =x,h=h (y) [ E H PY X =x,h=h (y) ] P H (h) (3) The second expression is obtained by noticing that P Z H=h (z)=p Y X =x,h=h (y)p X H=h (x)=p Y X =x,h=h (y)p X (x) because hypotheses do not have an influence on the generation of patterns Based on a given loss function l we can further decopose the first ter of the nuerator of (3) known as the likelihood of h Let us assue that the probability of a class y given an instance x and an hypothesis h is inverse proportional to the exponential of the loss incurred by h on x Thus we obtain exp( β l (h(x),y)) P Y X=x,H=h (y) = exp( β l (h(x),y )) = exp( β l (h(x),y)) C (x) y Y { +exp( β) if l (h(x),y) := l (h(x),y)= = exp( β) +exp( β) if l (h(x),y) := l (h(x),y)=, (4) where C (x) is a noralisation constant which in the case of the zeroone loss l is independent 2 of x and β controls the assued level of noise Note that the loss used in the exponentiated loss likelihood function In fact, it already suffices to assue that E Y X=x [l (y,y)] = E H [l (y,h(x))], ie the prior correctly odels the conditional distribution of the classes as far as the fixed loss is concerned 2 Note that for loss functions with realvalued arguents this need not be the case which akes a noralisation independent of x quite intricate (see Sollich, 2, for a detailed treatent) 248
5 BAYES POINT MACHINES is not to be confused with the decisiontheoretic loss used in the Bayesian fraework, which is introduced only after a posterior has been obtained in order to reach a risk optial decision Definition 6 (PAC Likelihood) Suppose we are given an arbitrary loss function l : Y Y R + Then, we call the function P Y X=x,H=h (y) := I y=h(x), (5) of h the PAC likelihood for h Note that (5) is the liiting case of (4) for β Assuing the PAC likelihood it iediately follows that for any prior belief P H the posterior belief P H Z =z siplifies to { PH (h) P H (V (z)) if h V (z) P H Z =z (h)=, (6) ifh / V (z) where the version space V (z) is defined as follows (see Mitchell, 977, 982) Definition 7 (Version Space) Given an hypothesis space H Y X and a training saple z =(x,y) (X Y ) of size N the version space V (z) H is defined by V (z) := { h H i {,,} : h(x i )=y i } Since all inforation contained in the training saple z is used to update the prior P H by equation (3) all that will be used to classify a novel test point x is the posterior belief P H Z =z 22 The Bayes Classification Strategy In order to classify a new test point x, for each class y the Bayes classification strategy 3 deterines the loss incurred by each hypothesis h H applied to x and weights it according to its posterior probability P H Z =z (h) The final decision is ade for the class y Y that achieves the iniu expected loss, ie Bayes z (x) := argin y Y This strategy has the following appealing property E H Z =z [l (H(x),y)] (7) Theore 8 (Optiality of the Bayes Classification Strategy) Suppose we are given a fixed hypothesis space H Y X Then, for any training saple size N, for any syetric loss l : Y Y R +, for any two easures P H and P X, aong all learning algoriths the Bayes classification strategy Bayes z given by (7) iniises the average generalisation error R [Bayes z ] under the assuption that for each h with P H (h) > y Y : x X : E Y X=x,H=h [l (y,y)] = l (y,h(x)) (8) Proof Let us consider a fixed learning algorith A Then it holds true that [ [ [ R [A] = E H EZ H=h EX EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EH EZ H=h EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EZ EH Z =z EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ = E X EZ EH Z =z [l ((A(Z))(X),H(X))] ]], (9) where we exchanged the order of expectations over X in the second line, applied the theore of repeated integrals (see, eg Feller, 966) in the third line and finally used (8) in the last line Using the syetry of the loss function, the innerost expression of (9) is iniised by the Bayes classification strategy (7) 3 The reason we do not call this apping fro X to Y a classifier is that the resulting apping is (in general) not within the hypothesis space considered beforehand 249
6 HERBRICH, GRAEPEL, CAMPBELL for any possible training saple z and any possible test point x Hence, (7) iniises the whole expression which proves the theore In order to enhance the understanding of this result let us consider the siple case of l = l and Y = {,+} Then, given a particular classifier h H having nonzero prior probability P H (h) >, by assuption (8) we require that the conditional distribution of classes y given x is delta peaked at h(x) because E Y X=x,H=h (l (y,y)) = l (y,h(x)), P Y X=x,H=h ( y) = I y h(x), P Y X=x,H=h (y) = I h(x)=y Although for a fixed h H drawn according to P H we do not know that Bayes z achieves the sallest generalisation error R[Bayes z ] we can guarantee that on average over the rando draw of h s the Bayes classification strategy is superior In fact, the optial classifier for a fixed h H is siply h itself 4 and in general Bayes z (x) h(x) for at least a few x X 23 The Bayes Point Algorith Although the Bayes classification strategy is on average the optial strategy to perfor when given liited aount of training data z, it is coputationally very deanding as it requires the evaluation of P H Z =z (l (H(x),y)) for each possible y at each new test point x (Graepel et al, 2) The proble arises because the Bayes classification strategy does not correspond to any one single classifier h H One way to tackle this proble is to require the classifier A(z) learned fro any training saple z to lie within a fixed hypothesis space H Y X containing functions h H whose evaluation at a particular test point x can be carried out efficiently Thus if it is additionally required to liit the possible solution of a learning algorith to a given hypothesis space H Y X, we can in general only hope to approxiate Bayes z Definition 9 (Bayes Point Algorith) Suppose we are given a fixed hypothesis space H X Y and a fixed loss l : Y Y R + Then, for any two easures P X and P H, the Bayes point algorith A bp is given by A bp (z) := argin h H E X [ EH Z =z [l (h(x),h(x))] ], that is, for each training saple z Z the Bayes point algorith chooses the classifier h bp := A bp (z) H that iics best the Bayes classification strategy (7) on average over randoly drawn test points The classifier A bp (z) is called the Bayes point Assuing the correctness of the odel given by (8) we furtherore reark that the Bayes point algorith A bp is the best approxiation to the Bayes classification strategy (7) in ters of the average generalisation error, ie easuring the distance of the learning algorith A for H using the distance A Bayes = R [A] R [Bayes] In this sense, for a fixed training saple z we can view the Bayes point h bp as a projection of Bayes z into the hypothesis space H Y X The difficulty with the Bayes point algorith, however, is the need to know the input distribution P X for the deterination of the hypothesis learned fro z This soehow liits the applicability of the algorith as opposed to the Bayes classification strategy which requires only broad prior knowledge about the underlying relationship expressed via soe prior belief P H 4 It is worthwhile entioning that the only inforation to be used in any classification strategy is the training saple z and the prior P H Hence it is ipossible to detect which classifier h H labels a fixed tuple x only on the basis of the labels y observed on the training saple Thus, although we ight be lucky in guessing h for a fixed h H and z Z we cannot do better than the Bayes classification strategy Bayes z when considering the average perforance the average being taken over the rando choice of the classifiers and the training saples z 25
7 BAYES POINT MACHINES 23 THE BAYES POINT FOR LINEAR CLASSIFIERS We now turn our attention to the special case of linear classifiers where we assue that N easureents of the objects x are taken by features φ i : X R thus foring a (vectorial) feature ap φ : X K l N 2 = (φ (x),,φ N (x)) Note that by this forulation the special case of vectorial objects x is autoatically taken care of by the identity ap φ(x)=x For notational convenience we use the shorthand notation 5 x for φ(x) such that x,w := N i= φ i (x)w i Hence, for a fixed apping φ the hypothesis space is given by H := { x sign( x,w ) w W }, W := {w K w = } () As each hypothesis h w is uniquely defined by its weight vector w we shall in the following consider prior beliefs P W over W, ie possible weight vectors (of unit length), in place of priors P H By construction, the output space is Y = {,+} and we furtherore consider the special case of l = l as defined by Definition 2 If we assue that the input distribution is spherically Gaussian in the feature space K of diensionality d = di(k ), ie f X (x)= ( exp x 2), () then we find that the centre of ass w c = π d 2 E W Z =z [W] EW Z =z [W] is a very good approxiation to the Bayes point w bp and converges towards w bp if the posterior belief P W Z =z becoes sharply peaked (for a siilar result see Watkin, 993) Theore (Optiality of the Centre of Mass) Suppose we are given a fixed apping φ : X K l N 2 Then, for all N, ifp X possesses the density () and the prior belief is correct, ie (8) is valid, the average generalisation error of the centre of ass as given by (2) always fulfils R [A c ] R [ Abp ] E Z [κ(ε(z))], (2) where and { arccos(ε) κ(ε) := π ε 2 if ε < 23 otherwise ε(z) := in w c,w w:p W Z =z (w)>, The lengthy proof of this theore is given in Appendix A The interesting fact to note about this result is that li ε κ(ε)= and thus whenever the prior belief P W is not vanishing for soe w, li E Z [κ(ε(z))] =, because for increasing training saple size the posterior is sharply peaked at the weight vector labelling the data 6 This shows that for increasing training saple size the centre of ass (under the posterior P W Z =z) is a good approxiation to the optial projection of the Bayes classification strategy the Bayes point Henceforth, any algorith which ais at returning the centre of ass under the posterior P W Z =z is called a Bayes point achine Note that in the case of the PAC likelihood as defined in Definition 6 the centre of ass under the posterior P W Z =z coincides with the centre of ass of version space (see Definition 7) 5 This should not be confused with x which denotes the saple (x,,x ) of training objects 6 This result is a slight generalisation of the result in Watkin (993) which only proved this to be true for the unifor prior P W 25
8 HERBRICH, GRAEPEL, CAMPBELL Ü Ü Û ¼ µ Û Ü Û Ü Û ¼ Figure : Shown is the argin a = γ x (w)= x,w under the assuption that w = x = At the sae tie, a (length of the dotted line) equals the distance of x fro the hyperplane {x x,w = } (dashed line) as well as the distance of the weight vector w fro the hyperplane {w x,w = } (dashed { line) Note, } however, that the Euclidean distance of w fro the separating boundary w W x,w = equals b(a) where b is a strictly onotonic function of its arguent 24 A (Pseudo) Bayesian Derivation of the Support Vector Machine In this section we would like to show that the well known support vector achine (Boser et al, 992; Cortes, 995; Vapnik, 995) can also be viewed as an approxiation to the centre of ass of version space V (z) in the noise free scenario, ie considering the PAC likelihood given in Definition 6, and additionally assuing that x i x : x i = φ(x i ) = const In order to see this let us recall that the support vector achine ais at axiising the argin γ z (w) of the weight vector w on the training saple z given by γ z (w) := in i {,,} y i x i,w } w {{ } γ xi (w) = in w y i x i,w, (3) i {,,} which for all w of unit length is erely the inial realvalued output (flipped to the correct sign) over the whole training saple In order to solve this proble algorithically one takes advantage of the fact that fixing the realvalued output to one (rather than the nor w of the weight vector w) renders the proble of finding the argin axiiser w SVM as a proble with a quadratic objective function ( w 2 = w w) under linear constraints (y i x i,w ), ie ( ) w SVM := argax w W in y i x i,w i {,,} argin w {v ini {,,} y i x i,v =} (4) ( w 2) (5) Note that the set of weight vectors in (5) are called the weight vectors of the canonical hyperplanes (see Vapnik, 998, p 42) and that this set is highly dependent on the given training saple Nonetheless, the solution to (5) is (up to scaling) equivalent to the solution of (4) a forulation uch ore aenable for theoretical studies Interestingly, however, the quantity γ xi (w) as iplicitly defined in (3) is not only the distance of the point y i x i fro the hyperplane having the noral w but also x i ties the Euclidean distance of the point w fro the hyperplane having the noral y i x i (see Figure ) Thus γ z (w) can be viewed as the radius of 252
9 BAYES POINT MACHINES the ball { v W w v b(γ z (w)) } that only contains weight vectors in version space V (z) Here, b : R + R + is a strictly onotonic function of its arguent and its effect is graphically depicted in Figure As a consequence thereof, axiising the argin γ z (w) over the choice of w returns the classifier w SVM that is the centre of the largest ball still inscribable in version space Note that the whole reasoning relied on the assuption that all training points x i have a constant nor in feature space K If this assuption is violated, each distance of a classifier w to the hyperplane having the noral y i x i is easured on a different scale and thus the points with the largest nor x i in feature space K have the highest influence on the resulting solution To circuvent this proble is has been suggested elsewhere that input vectors should be noralised in feature space before applying any kernel ethod in particular the support vector achine algorith (see Herbrich and Graepel, 2; Schölkopf et al, 999; Joachis, 998; Haussler, 999) Furtherore, all indices I SV {,,} at which the iniu y i x i,w SVM in (4) is attained are the ones for which y i x i,w = in the forulation (5) As the latter are called support vectors we see that the support vectors are the training points at which the largest inscribable ball touches the corresponding hyperplane { w W (yi x i,w = ) } 25 Applying the Kernel Trick When solving (5) over the possible choices of w W it is well known that the solution w SVM adits the following representation w SVM = α i x i, that is the solution to (5) ust live in the linear span of the training points This follows naturally fro the following theore (see also Schölkopf et al, 2) Theore (Representer Theore) Suppose we are given a fixed apping φ : X K l N 2, a training saple z =(x,y) Z, a cost function c : X Y R R { } strictly onotonically decreasing in the third arguent and the class of linear functions in K as given by () Then any w z W defined by w z := argin c(x,y,( x,w,, x,w )) (6) w W adits a representation of the for i= α R : w z = i= α i x i (7) The proof is given in Appendix A2 In order to see that this theore applies to support vector achines note that (4) is equivalent to the iniiser of (6) when using c(x,y,( x,w,, x,w )) = in y i y y i x i,w, which is strictly onotonically decreasing in its third arguent A slightly ore difficult arguent is necessary to see that the centre of ass (2) can also be written as a iniiser of (6) using a[ specific cost function c At first we recall that the centre of ass has the property of iniising E W Z =z w W 2] over the choice of w W (see also (3)) Theore 2 (Sufficiency of the linear span) Suppose we are given a fixed apping φ : X K l N 2 Let us assue that P W is unifor and P Y X=x,W=w (y) = f (sign(y x,w )), ie the likelihood depends on the sign of the realvalued output y x,w of w Let L x := { i= α ix i α R } be the linear span of apped data points {x,,x } and W x := W L x Then for any training saple z Z and any w W w W v 2 dp W Z =z (v)=c w v 2 dp W Z W =z (v), (8) x 253
10 HERBRICH, GRAEPEL, CAMPBELL that is, up to a constant C R + that is independent of w it suffices to consider vectors of unit length in the linear span of the apped training points {x,,x } The proof is given in Appendix A3 An iediate consequence of this theore is the fact that we only need to consider the diensional sphere W x in order to find the centre of ass under the assuption of a unifor prior P W Hence a loss function c such that (6) finds the centre of ass is given by ( ) c(x,y,( x,w,, x,w )) = 2 α i x i,w dp A Z R =(x,y) where P A Z =z is only nonzero for vectors α such that i= α ix i = and is independent of w The treendous advantage of a representation of the solution w z by (7) becoes apparent when considering the realvalued output of a classifier at any given data point (either training or test point) w z,x = α i x i,x = α i x i,x = α i k (x i,x) i= i= i= Clearly, all that is needed in the feature space K is the inner product function k (x, x) := φ(x),φ( x) (9) Reversing the chain of arguents indicates how the kernel trick ay be used to find an efficient ipleentation We fix a syetric function k : X X R called kernel and show that there exists a feature apping φ k : X K l N 2 such that (9) is valid for all x, x X A sufficient condition for k being a valid inner product function is given by Mercer s theore (see Mercer, 99) In a nutshell, whenever the evaluation of k at any given saple (x,,x ) results in a positive seidefinite atrix G ij := k (x i,x j ) then k is a so called Mercer kernel The atrix G is called the Gra atrix and is the only quantity needed in support vector and Bayes point achine learning For further details on the kernel trick the reader is referred to Schölkopf et al (999); Cristianini and ShaweTaylor (2); Wahba (99); Vapnik (998) 3 Estiating the Bayes Point in Feature Space In order to estiate the Bayes point in feature space K we consider a Monte Carlo ethod, ie instead of exactly coputing the expectation (2) we approxiate it by an average over weight vectors w drawn according to P W Z =z and restricted to W x (see Theore 2) In the following we will restrict ourselves to the PAC likelihood given in (5) and P W being unifor on the unit sphere W K By this assuption we know that the posterior is unifor over version space (see (6)) In Figure 2 we plotted an exaple for the special case of N = 3 diensional feature space K It is, however, already very difficult to saple uniforly fro version space V (z) as this set of points lives on a convex polyhedron on the unit sphere in 7 W x In the following two subsections we present two ethods to achieve this sapling The first ethod develops on an idea of Ruján (997) (later followed up by a kernel version of the algorith in Ruján and Marchand, 2) that is based on the idea of playing billiards in version space V (z), ie after entering the version space with a very siple learning algorith such as the kernel perceptron (see Algorith ) the classifier w isconsidered as a billiardball and isbounced fora while within the convex polyhedron V (z) If this billiard is ergodic with respect to the unifor distribution over V (z), ie the travel tie of the billiard ball spent in a subset W V (z) is proportional to W V (z), then averaging over the trajectory of the billiard ball leads in the liit of an infinite nuber of bounces to the centre of ass of version space The second ethod presented tries to overcoe the large coputational deands of the billiard ethod by only approxiately achieving a unifor sapling of version space The idea is to use the perceptron 7 Note that by Theore 2 it suffices to saple fro the projection of the version space onto W x 254
11 BAYES POINT MACHINES Figure 2: Plot of a version space (convex polyhedron containing the black dot) V (z) in a 3 diensional feature space K Each hyperplane is defined by a training exaple via its noral vector y i x i learning algorith in dual variables with different perutations Π : {,,} {,,} so as to obtain different consistent classifiers w i V (z) (see Watkin, 993, for a siilar idea) Obviously, the nuber of different saples obtained is finite and thus it is ipossible to achieve exactness of the ethod in the liit of considering all perutations Nevertheless, we shall deonstrate that in particular for the task of handwritten digit recognition the achieved perforances are coparable to stateoftheart learning algoriths Finally, we would like to reark that recently there have been presented other efficient ethods to estiate the Bayes point directly (Rychetsky et al, 2; Minka, 2) The ain idea in Rychetsky et al (2) is to work out all corners w i of version space and average over the in order to approxiate the centre of ass of version space Note that there are exactly corners because the i th corner w i satisfies x j,w i = for all j i and yi x i,w i > If X =(x,,x ) is the N atrix of apped training points x =(x,,x ) flipped to their correct side and we use the approach (7) for w this siplifies to X w i = X Xα i = Gα i =(,,,y i,,) =: y i e i where the rhs is the i th unit vector ultiplied by y i As a consequence, the expansion coefficients α i of the i th corner w i can easily be coputed as α i = y i G e i and then need to be noralised such that w i = The difficulty with this approach, however, is the fact that the inversion of the Gra atrix G is O ( 3) and is thus as coputationally coplex as support vector learning while not enjoying the anytie property of a sapling schee The algorith presented in Minka (2, Chapter 5) (also see Opper and Winther, 2, for an equivalent ethod) uses the idea of approxiating the posterior easure P W Z =z by a product of Gaussian densities so that the centre of ass can be coputed analytically Although the approxiation of the cutoff posterior over P W Z =z resulting fro the deltapeaked likelihood given in Definition 6 by Gaussian easures sees very crude at first glance, Minka could show that his ethod copares favourably to the results presented in this paper 3 Playing Billiards in Version Space In this subsection we present the billiard ethod to estiate the Bayes point, ie the centre of ass of version space when assuing a PAC likelihood and a unifor prior P W over weight vectors of unit length (the pseudo 255
12 HERBRICH, GRAEPEL, CAMPBELL º Û Ý Ü Û ¼ Û Ý Ü Û ¼ Û Ý ¾ Ü ¾ Û ¼ ½ Û Ñ ¼ ¾ Û Ý ½ Ü ½ Û ¼ Figure 3: Scheatic view of the kernel billiard algorith Starting at b V (z) a trajectory of billiard bounces b,,b 5, is calculated and then averaged over so as to obtain an estiate ŵ c of the centre of ass of version space code is given on page 275) By Theore 2 each position b of the billiard ball and each estiate w i of the centre of ass of V (z) can be expressed as linear cobinations of the apped input points, ie w = i= α i x i, b = i= γ i x i, α,γ R Without loss of generality we can ake the following ansatz for the direction vector v of the billiard ball v = i= β i x i, β R Using this notation inner products and nors in feature space K becoe b,v = i= j= γ i β j k (x i,x j ), b 2 = i, j= γ i γ j k (x i,x j ), (2) where k : X X R is a Mercer kernel and has to be chosen beforehand At the beginning we assue that w = α = Before generating a billiard trajectory in version space V (z) we first run any learning algorith to find an initial starting point b inside the version space (eg support vector learning or the kernel perceptron (see Algorith )) Then the kernel billiard algorith consists of three steps (see also Figure 3): Deterine the closest boundary in direction v i starting fro current position b i Since it is coputationally very deanding to calculate the flight tie of the billiard ball on geodesics of the hypersphere W x (see also Neal, 997) we ake use of the fact that the shortest distance in Euclidean space (if it exists) is also the shortest distance on the hypersphere W x Thus, we have for the flight tie τ j of the billiard ball at position b i in direction v i to the hyperplane with noral vector y j x j bi,x j τ j = (2) vi,x j After calculating all flight ties, we look for the sallest positive, ie c = argin τ j j {i τ i > } 256
13 BAYES POINT MACHINES Deterining the closest bounding hyperplane in Euclidean space rather than on geodesics causes probles if the surface of the hypersphere W x is alost orthogonal to the direction vector v i, in which case τ c If this happens we randoly generate a direction vector v i pointing towards the version space V (z) Assuing that the last bounce took place at the hyperplane having noral y c x c this condition can easily be checked by y c v i,x c > (22) Note that since the saples are taking fro the bouncing points the above procedure of dealing with the curvature of the hypersphere does not constitute an approxiation but is exact An alternative ethod of dealing with the proble of the curvature of the hypersphere W can be found in Minka (2, Section 58) 2 Update the billiard ball s position to b i+ and the new direction vector to v i+ The new point b i+ and the new direction v i+ are calculated fro b i+ = b i + τ c v i, (23) v i+ = v i 2 v i,x c x c 2 x c (24) Afterwards the position b i+ and the direction vector v i+ need to be noralised This is easily achieved by equation (2) 3 Update the centre of ass w i of the whole trajectory by the new line segent fro b i to b i+ calculated on the hypersphere W x Since the solution w lies on the hypersphere W x (see Theore ) we cannot siply update the centre of ass using a weighted vector addition Let us introduce the operation µ acting on vectors of unit length This function has to have the following properties s µ t 2 =, t s µ t = µ t s, s µ t = ρ ( s,t,µ)s + ρ 2 ( s,t,µ)t, ρ ( s,t,µ), ρ 2 ( s,t,µ) This rather arcane definition ipleents a weighted addition of s and t such that µ is the fraction between the resulting chord length t s µ t and the total chord length t s In Appendix A4 it is shown that the following forulae for ρ ( s,t,µ) and ρ 2 ( s,t,µ) ipleent such a weighted addition ρ ( s,t,µ) = µ µ2 µ 2 s,t 2, s,t + ρ 2 ( s,t,µ) = ρ ( s,t,µ) s,t ± ( µ 2 ( s,t ) ) By assuing a constant line density on the anifold V (z) the whole line between b i and b i+ can be represented by the idpoint on the anifold V (z) given by = b i + b i+ b i + b i+ Thus, one updates the centre of ass of the trajectory by ( Ξ i w i+ = ρ ( w i,, )w i + ρ 2 w i,, Ξ i + ξ i Ξ i Ξ i + ξ i ), 257
14 HERBRICH, GRAEPEL, CAMPBELL where ξ i = b i b i+ is the length of the trajectory in the i th step and Ξ i = i j= ξ j for the accuulated length up to the i th step Note that the operation µ is only an approxiation to addition operation we sought because an exact weighting would require the arc lengths rather than chord lengths As a stopping criterion we suggest coputing an upper bound on ρ 2, the weighting factor of the new part of the trajectory If this value falls below a prespecified threshold (TOL) we stop the algorith Note that the increase in Ξ i will always lead to terination 32 Large Scale Bayes Point Machines Clearly, all we need for estiating the centre of ass of version space (2) is a set of unit length weight vectors w i drawn uniforly fro V (z) In order to save coputational resources it ight be advantageous to achieve a unifor saple only approxiately The classical perceptron learning algorith offers the possibility to obtain up to! different classifiers in version space siply by learning on different perutations of the training saple Of course due to the sparsity of the solution the nuber of different classifiers obtained is usually considerably less A classical theore to be found in Novikoff (962) guarantees the convergence of this procedure and furtherore provides an upper bound on the nuber t of istakes needed until convergence More precisely, if there exists a classifier w SVM with argin γ z (w SVM ) > (see (3)) then the nuber of istakes until convergence which is an upper bound on the sparsity of the solution is not ore than ς 2 γ 2 z (w SVM ), where ς is the sallest real nuber such that x i K ς The quantity γ z (w SVM ) is axiised for the solution w SVM found by the support vector achine, and whenever the support vector achine is theoretically justified by results fro learning theory (see ShaweTaylor et al, 998; Vapnik, 998) the ratio ς 2 γ 2 z (w SVM ) is considerably less than, say d Algorithically, we can benefit fro this sparsity by the following trick : since w = α i x i i= all we need to store is the diensional vector α Furtherore, we keep track of the diensional vector o of realvalued outputs o i = x i,w t = α j k (x i,x j ) of the current solution at the i th training point By definition, in the beginning α = o = Now, if o i y i < we update α i by α i + y i and update o by o j o j + y i k (x i,x j ) which requires only kernel calculations (the evaluation of the i th row of the Gra atrix G) In suary, the eory requireent of this algorith is 2 and the nuber of kernel calculations is not ore than d As a consequence, the coputational requireent of this algorith is no ore than the coputational requireent for the evaluation of the argin γ z (w SVM )! We suggest to use this efficient perceptron learning algorith in order to obtain saples w i for the coputation of the centre of ass (2) In order to investigate the usefulness of this approach experientally, we copared the distribution of generalisation errors of saples obtained by perceptron learning on peruted training saples with saples obtained by a full Gibbs sapling (see Graepel and Herbrich, 2, for details on the kernel Gibbs sapler) For coputational reasons, we used only 88 training patterns and 453 test patterns of the classes and 2 fro the MNIST data set 8 In Figure 4 (a) and (b) we plotted the distribution over rando saples using the kernel 9 k ( x,x ) = ( x,x + ) 5 (25) j= Using a quantilequantile (QQ) plot technique we can copare both distributions in one graph (see Figure 4 (c)) These plots suggest that by siple perutation of the training saple we are able to obtain a saple of classifiers exhibiting a siilar distribution of generalisation error to the one obtained by tieconsuing Gibbs sapling 8 This data set is publicly available at 9 We decided to use this kernel because it showed excellent generalisation perforance when using the support vector achine 258
15 BAYES POINT MACHINES frequency frequency kernel perceptron generalisation error generalisation error kernel Gibbs sapler (a) (b) (c) Figure 4: (a) Histogra of generalisation errors (estiated on a test set) using a kernel Gibbs sapler (b) Histogra of generalisation errors (estiated on a test set) using a kernel perceptron (c) QQ plot of distributions (a) and (b) The straight line indicates that the two distributions only differ by an additive and ultiplicative constant, ie they exhibit the sae rate of decay A very advantageous feature of this approach as copared to support vector achines are its adjustable tie and eory requireents and the anytie availability of a solution due to sapling If the training saple grows further and we are not able to spend ore tie learning, we can adjust the nuber of saples w used at the cost of slightly worse generalisation error (see also Section 4) 33 Extension to Training Error To allow for training errors we recall that the version space conditions are given by (x i,y i ) z : y i x i,w = y i j=α j k (x i,x j ) > (26) Now we introduce the following version space conditions in place of (26): (x i,y i ) z : y i j=α j k (x i,x j ) > λy i α i k(x i,x i ), (27) where λ is an adjustable paraeter related to the softness of version space boundaries Clearly, considering this fro the billiard viewpoint, equation (27) can be interpreted as allowing penetration of the walls, an idea already hinted at in Ruján (997) Since the linear decision function is invariant under any positive rescaling of expansion coefficients α, a factor α i on the right hand side akes λ scale invariant as well Although other ways of incorporating training errors are conceivable our forulation allows for a siple odification of the algoriths described in the previous two subsections To see this we note that equation (27) can be rewritten as ) > (x i,y i ) z : y i ( j=α j ( + λi i= j )k (x i,x j ) Hence we can use the above algoriths but with an additive correction to the diagonal ters of the Gra atrix This additive correction to the kernel diagonals is siilar to the quadratic argin loss used to introduce a soft argin during training of support vector achines (see Cortes, 995; ShaweTaylor and Cristianini, 2) Another insight into the introduction of soft boundaries coes fro noting that the distance between two points x i and x j in feature space K can be written x i x j 2 = x i 2 + x j 2 2 xi,x j, 259
16 HERBRICH, GRAEPEL, CAMPBELL λ = λ = 5 λ = λ = 5 λ = 2 λ = 25 5 Figure 5: Paraeter spaces for a two diensional toy proble obtained by introducing training error via an additive correction to the diagonal ter of the kernel atrix In order to visualise the resulting paraeter space we fixed = 3 and noralised all axes by the product of eigenvalues λ λ 2 λ 3 See text for further explanation which in the case of points of unit length in feature space becoes 2( + λ k (x i,x j )) Thus, if we add λ to the diagonal eleents of the Gra atrix, the points becoe equidistant for λ This would give the resulting version space a ore regular shape As a consequence, the centre of the largest inscribable ball (support vector achine solution) would tend towards the centre of ass of the whole of version space We would like to recall that the effective paraeter space of weight vectors considered is given by { } W x := w = α i x i w 2 = α i α j xi,x j = i= i= In ters of α this can be rewritten as { α R α Gα = } G ij = x i,x j = k (xi,x j ) Let us represent the Gra atrix by its spectral decoposition, ie G = UΛU where U U = I and Λ = diag(λ,,λ ) being the diagonal atrix of eigenvalues λ i Thus we know that the paraeter space is the set of all coefficients α = U α which fulfil { α R : α Λ α = } j= This is the defining equation of an diensional axis parallel ellipsoid Now adding the ter λ to the diagonal of G akes G a full rank atrix (see Micchelli, 986) In Figure 5 we plotted the paraeter space for a 2D toy proble using only = 3 training points Although the paraeter space is 3 diensional for all λ > we obtain a pancake like paraeter space for sall values of λ Forλ the set α of adissible coefficients becoes the diensional ball, ie the training exaples becoe ore and ore orthogonal with increasing λ The way we incorporated training errors corresponds to the choice of a new kernel given by k λ (x, x) := k (x, x)+λ I x= x 26
17 BAYES POINT MACHINES Figure 6: Version spaces V (z) for two 3 diensional toy probles (Left) One can see that the approxiation of the Bayes point (diaond) by the centre of the largest inscribable ball (cross) is reasonable if the version space is regularly shaped (Right) The situation changes in the case of an elongated and asyetric version space V (z) Finally, note that this odification of the kernel has no effect on new test points x / x that are not eleents of the training saple x For an explanation of the effect of λ in the context of Gaussian processes see Opper and Winther (2) 4 Experiental Results In this section we present experiental results both on University of California, Irvine (UCI) benchark datasets and on two bigger task of handwritten digit recognition, naely US postal service (USPS) and odified National Institute of Standards (MNIST) digit recognition tasks We copared our results to the perforance of a support vector achine using reported test set perforance fro Rätsch et al (2) (UCI) Schölkopf (997, p 57) (USPS) and Cortes (995) (MNIST) All the experients were done using Algorith 2 in Appendix B 4 Artificial Data For illustration purposes we setup a toy dataset of training and test points in R 3 The data points were uniforly generated in [,] 3 and labelled by a randoly generated linear decision rule using the kernel k (x, x) = x, x In Figures 6 we illustrate the potential benefits of a Bayes point achine over a support vector achine for elongated version spaces By using the billiard algorith to estiate the Bayes point (see Subsection 3), we were able to track all positions b i where the billiard ball hits a version space boundary This allows us to easily visualise the version spaces V (z) For the exaple illustrated in Figure 6 (right) the support vector achine and Bayes point solutions with hard argins/boundaries are far apart resulting in a noticeable reduction in generalisation error of the Bayes point achines (8%) copared to the support vector achine (5%) solution whereas for regularly shaped version spaces (Figure 6 (left)) the difference is negligible (6% to 6%) publicly available at 26
18 HERBRICH, GRAEPEL, CAMPBELL SVM BPM Y 2 Y X 2 2 X Figure 7: Decision functions for a 2D toy proble of a support vector achine (SVM) (left) and Bayes point achine (BPM) (right) using hard argins (λ = ) and RBF kernels with σ = Note that the Bayes point achine result in a uch flatter function sacrificing argin (γ z (w SVM )=36 γ z (w c )=2) for soothness In a second illustrative exaple we copared the soothness of the resulting decision function when using kernels both with support vector achines and Bayes point achines In order to odel a nonlinear decision surface we used the radial basis function (RBF) kernel ( ) x x 2 k (x, x)=exp 2σ 2 (28) Figure 7 shows the resulting decision functions in the hard argin/boundary case Clearly, the Bayes point achine solution appears uch soother than the support vector achine solution although its geoetrical argin of 2 is significantly saller The above exaples should only be considered as aids to enhance the understanding of the Bayes point achines algorith s properties rather than strict arguents about general superiority 42 UCI Benchark Datasets To investigate the perforance on real world datasets we copared hard argin support vector achines to Bayes point achines with hard boundaries (λ = ) when using the kernel billiard algorith described in Subsection 3 We studied the perforance on 5 standard bencharking datasets fro the UCI Repository, and banana and wavefor, two toy datasets (see Rätsch et al, 2) In each case the data was randoly partitioned into training and test sets in the ratio 6%:4% The eans and standard deviations of the average generalisation errors on the test sets are presented as percentages in the coluns headed SVM (hard argin) and BPM (λ = ) in Table As can be seen fro the results, the Bayes point achine outperfors support vector achines on alost all datasets at a statistically significant level Note, however, that the result of the ttest is strictly valid only under the assuption that training and test data were independent an assuption which ay be violated by the procedure of splitting the one data set into different pairs of training and test sets (Dietterich, 998) Thus, the resulting p values should serve only as an indication for the significance of the result In order to deonstrate the effect of positive λ (soft boundaries) we trained a Bayes point achine with soft boundaries and copared it to training a support vector achine with soft argin using the sae Gra 262
19 BAYES POINT MACHINES SVM (hard argin) BPM (hard boundary) σ pvalue Heart 254±4 228±34 Thyroid 53±24 44±2 3 Diabetes 33±24 32±25 5 Wavefor 3± 2±9 2 Banana 62±5 5±4 5 Sonar 54±37 59±38 Ionosphere 9±25 5± Table : Experiental results on seven benchark datasets We used the RBF kernel given in (28) with values of σ found optial for SVMs Shown is the estiated generalisation error in percent The standard deviation was obtained on different runs The final colun gives the p values of a paired t test for the hypothesis BPM is better than SVM indicating that the iproveent is statistically significant atrix (see equation (27)) It can be shown that such a support vector achine corresponds to a soft argin support vector achine where the argin slacks are penalised quadratically (see Cortes, 995; ShaweTaylor and Cristianini, 2; Herbrich, 2) In Figure 8 we have plotted the generalisation error as a function of λ for the toy proble fro Figure 6 and the dataset heart using the sae setup as in the previous experient We observe that the support vector achine with an l 2 soft argin achieves a iniu of the generalisation error which is close to, or just above, the iniu error which can be achieved using a Bayes point achine with positive λ This ay not be too surprising taking the change of geoetry into account (see Section 33) Thus, also the soft argin support vector achine approxiates Bayes point achine with soft boundaries Finally we would like to reark that the running tie of the kernel billiard was not uch different fro the running tie of our support vector achine ipleentation We did not use any chunking or decoposition algoriths (see, eg Osuna et al, 997; Joachis, 999; Platt, 999) which in case of support vector achines would have decreased the running tie by orders of agnitudes The ost noticeable difference in running tie was with the wavefor and banana dataset where we are given = 4 observations This can be explained by the fact that the coputational effort of the kernel billiard ethod is O ( B 2) where B is the nuber of bounces As we set our tolerance criterion TOL for stopping very low ( 4 ), the approxiate nuber B of bounces for these datasets was B Hence, in contrast to the coputational effort of using the support vector achines of O ( 3) the nuber B of bounces lead to a uch higher coputational deand when using the kernel billiard 43 Handwritten Digit Recognition For the two tasks we now consider our inputs are n n grey value iages which were transfored into n 2 diensional vectors by concatenation of the rows The grey values were taken fro the set {,,255} All iages were labelled by one of the ten classes to 9 For each of the ten classes y = {,,9} we ran the perceptron algorith L = ties each tie labelling all training points of class y by + and the reaining training points by On a Pentiu III 5 MHz with 28 MB eory each learning trial took 2 inutes (MNIST) or 2 inutes (USPS), respectively For the classification of a test iage x Note, however, that we ade use of the fact that 4% of the grey values of each iage are since they encode background Therefore, we encoded each iage as an indexvalue list which allows uch faster coputation of the inner products x, x and speeds up the algorith by a factor of
20 HERBRICH, GRAEPEL, CAMPBELL classification error 5 5 SVM BPM classification error SVM BPM λ λ Figure 8: Coparison of soft boundary Bayes point achine with soft argin support vector achine Plotted is the generalisation error versus λ for a toy proble using linear kernels (left) and the heart dataset using RBF kernels with σ = 3 (right) The error bars indicate one standard deviation of the estiated ean we calculated the realvalued output of all different classifiers 2 by f i (x) = x,w i w i x = r= s= (α i ) j k (x j,x) j= (α i ) r (α i ) s k (x r,x s ), k (x,x) where we used the kernel k given by (25) Here, (α i ) j refers to the expansion coefficient corresponding to the i th classifier and the j th data point Now, for each of the ten classes we calculated the realvalued decision of the Bayes point estiate ŵ c,y by 3 f bp,y (x)= x,ŵ c,y = L In a Bayesian spirit, the final decision was carried out by h bp (x) := argax y {,,9} L i= x,w i+yl f bp,y (x) Note that f bp,y (x) can be interpreted as an (unnoralised) approxiation of the posterior probability that x is of class y when restricted to the function class () (see Platt, 2) In order to test the dependence of the generalisation error on the agnitude ax y f bp,y (x) we fixed a certain rejection rate r [,] and rejected the set of r test points with the sallest value of ax y f bp,y (x) MNIST Handwritten Digits In the first of our large scale experient we used the full MNIST dataset with 6 training exaples and test exaples of grey value iages of handwritten digits The plot resulting fro learning only consistent classifiers per class and rejection based on the realvalued output of the single Bayes points is depicted in Figure 9 (left) As can be seen fro this plot, even without rejection the Bayes point has excellent generalisation perforance when copared to support vector achines which achieve a generalisation error of 4 4% Furtherore, rejection based on the realvalued 2 For notational siplicity we assue that the first L classifiers are classifiers for the class, the next L for class and so on 3 Note that in this subsection y ranges fro {,,9} 4 The result of % with the kernel (25) and a polynoial degree of four could not be reproduced and is thus considered invalid (personal counication with P Haffner) Note also that the best results with support vector achines were obtained when using a soft argin 264
ABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION
MEASURING COMONOTONICITY IN MDIMENSIONAL VECTORS BY INGE KOCH AND ANN DE SCHEPPER ABSTRACT In this contribution, a new easure of coonotonicity for diensional vectors is introduced, with values between
More informationChoosing Multiple Parameters for Support Vector Machines
Machine Learning, 46, 131 159, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Choosing Multiple Parameters for Support Vector Machines OLIVIER CHAPELLE LIP6, Paris, France olivier.chapelle@lip6.fr
More informationGaussian Processes for Regression: A Quick Introduction
Gaussian Processes for Regression A Quick Introduction M Ebden, August 28 Coents to arkebden@engoacuk MOTIVATION Figure illustrates a typical eaple of a prediction proble given soe noisy observations of
More informationA Tutorial on Support Vector Machines for Pattern Recognition
c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies
More informationNew Support Vector Algorithms
LETTER Communicated by John Platt New Support Vector Algorithms Bernhard Schölkopf Alex J. Smola GMD FIRST, 12489 Berlin, Germany, and Department of Engineering, Australian National University, Canberra
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationThe Set Covering Machine
Journal of Machine Learning Research 3 (2002) 723746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,
More informationSelection of the Number of Principal Components: The Variance of the Reconstruction Error Criterion with a Comparison to Other Methods
Ind. Eng. Che. Res. 1999, 38, 43894401 4389 Selection of the Nuber of Principal Coponents: The Variance of the Reconstruction Error Criterion with a Coparison to Other Methods Sergio Valle, Weihua Li,
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics:
More informationFlexible and efficient Gaussian process models for machine learning
Flexible and efficient Gaussian process models for machine learning Edward Lloyd Snelson M.A., M.Sci., Physics, University of Cambridge, UK (2001) Gatsby Computational Neuroscience Unit University College
More informationMaxPlanckInstitut für biologische Kybernetik Arbeitsgruppe Bülthoff
MaxPlanckInstitut für biologische Kybernetik Arbeitsgruppe Bülthoff Spemannstraße 38 7276 Tübingen Germany Technical Report No. 44 December 996 Nonlinear Component Analysis as a Kernel Eigenvalue Problem
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationSupport Vector Data Description
Machine Learning, 54, 45 66, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Support Vector Data Description DAVID M.J. TAX davidt@first.fhg.de ROBERT P.W. DUIN r.p.w.duin@tnw.tudelft.nl
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationSupport Vector Machines
CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)
More informationA Unifying View of Sparse Approximate Gaussian Process Regression
Journal of Machine Learning Research 6 (2005) 1939 1959 Submitted 10/05; Published 12/05 A Unifying View of Sparse Approximate Gaussian Process Regression Joaquin QuiñoneroCandela Carl Edward Rasmussen
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version 2/8/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationSampling 50 Years After Shannon
Sampling 50 Years After Shannon MICHAEL UNSER, FELLOW, IEEE This paper presents an account of the current state of sampling, 50 years after Shannon s formulation of the sampling theorem. The emphasis is
More informationFoundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More informationOPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationA Study on SMOtype Decomposition Methods for Support Vector Machines
1 A Study on SMOtype Decomposition Methods for Support Vector Machines PaiHsuen Chen, RongEn Fan, and ChihJen Lin Department of Computer Science, National Taiwan University, Taipei 106, Taiwan cjlin@csie.ntu.edu.tw
More informationGene Selection for Cancer Classification using Support Vector Machines
Gene Selection for Cancer Classification using Support Vector Machines Isabelle Guyon+, Jason Weston+, Stephen Barnhill, M.D.+ and Vladimir Vapnik* +Barnhill Bioinformatics, Savannah, Georgia, USA * AT&T
More informationWhy Does Unsupervised Pretraining Help Deep Learning?
Journal of Machine Learning Research 11 (2010) 625660 Submitted 8/09; Published 2/10 Why Does Unsupervised Pretraining Help Deep Learning? Dumitru Erhan Yoshua Bengio Aaron Courville PierreAntoine Manzagol
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationA First Encounter with Machine Learning. Max Welling Donald Bren School of Information and Computer Science University of California Irvine
A First Encounter with Machine Learning Max Welling Donald Bren School of Information and Computer Science University of California Irvine November 4, 2011 2 Contents Preface Learning and Intuition iii
More informationTwo faces of active learning
Two faces of active learning Sanjoy Dasgupta dasgupta@cs.ucsd.edu Abstract An active learner has a collection of data points, each with a label that is initially hidden but can be obtained at some cost.
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More information1 An Introduction to Conditional Random Fields for Relational Learning
1 An Introduction to Conditional Random Fields for Relational Learning Charles Sutton Department of Computer Science University of Massachusetts, USA casutton@cs.umass.edu http://www.cs.umass.edu/ casutton
More information