Bayes Point Machines


 Kristopher Hancock
 2 years ago
 Views:
Transcription
1 Journal of Machine Learning Research (2) Subitted 2/; Published 8/ Bayes Point Machines Ralf Herbrich Microsoft Research, St George House, Guildhall Street, CB2 3NH Cabridge, United Kingdo Thore Graepel Technical University of Berlin, Franklinstr 28/29, 587 Berlin, Gerany Colin Capbell Departent of Engineering Matheatics, Bristol University, BS8 TR Bristol, United Kingdo Editor: Christopher K I Willias Abstract Kernelclassifiers coprise a powerful class of nonlinear decision functions for binary classification The support vector achine is an exaple of a learning algorith for kernel classifiers that singles out the consistent classifier with the largest argin, ie inial realvalued output on the training saple, within the set of consistent hypotheses, the socalled version space We suggest the Bayes point achine as a wellfounded iproveent which approxiates the Bayesoptial decision by the centre of ass of version space We present two algoriths to stochastically approxiate the centre of ass of version space: a billiard sapling algorith and a sapling algorith based on the well known perceptron algorith It is shown how both algoriths can be extended to allow for softboundaries in order to adit training errors Experientally, we find that for the zero training error case Bayes point achines consistently outperfor support vector achines on both surrogate data and realworld benchark data sets In the softboundary/softargin case, the iproveent over support vector achines is shown to be reduced Finally, we deonstrate that the realvalued output of single Bayes points on novel test points is a valid confidence easure and leads to a steady decrease in generalisation error when used as a rejection criterion Introduction Kernel achines have recently gained a lot of attention due to the popularisation of the support vector achine (Vapnik, 995) with a focus on classification and the revival of Gaussian processes for regression (Willias, 999) Subsequently, support vector achines have been odified to handle regression (Sola, 998) and Gaussian processes have been adapted to the proble of classification (Willias and Barber, 998; Opper and Winther, 2) Both schees essentially work in the sae function space that is characterised by kernels and covariance functions, respectively Whilst the foral siilarity of the two ethods is striking, the underlying paradigs of inference are very different The support vector achine was inspired by results fro statistical/pac learning theory while Gaussian processes are usually considered in a Bayesian fraework This ideological clash can be viewed as a continuation in achine learning of the by now classical disagreeent between Bayesian and frequentistic statistics (Aitchison, 964) With regard to algorithics the two schools of thought appear to favour two different ethods of learning and predicting: the support vector counity as a consequence of the forulation of the support vector achine as a quadratic prograing proble focuses on learning as optiisation while the Bayesian counity favours sapling schees based on the Bayesian posterior Of course there exists a strong relationship between the two ideas, in particular with the Bayesian axiu a posteriori (MAP) estiator being the solution of an optiisation proble In practice, optiisation based algoriths have the advantage of a unique, deterinistic solution and the availability of the cost function as an indicator of the quality of the solution In contrast, Bayesian algoriths based on sapling and voting are ore flexible and enjoy the socalled anytie property, providing a c 2 R Herbrich, T Graepel, C Capbell
2 HERBRICH, GRAEPEL, CAMPBELL relatively good solution at any point in tie Often, however, they suffer fro the coputational costs of sapling the Bayesian posterior In this paper we present the Bayes point achine as an approxiation to Bayesian inference for linear classifiers in kernel space In contrast to the Gaussian process viewpoint we do not define a Gaussian prior on the length w of the weight vector Instead, we only consider weight vectors of length w = because it is only the spatial direction of the weight vector that atters for classification It is then natural to define a unifor prior on the resulting ballshaped hypothesis space Hence, we deterine the centre of ass of the resulting posterior that is unifor in version space, ie in the zero training error region It should be kept in ind that the centre of ass is erely an approxiation to the real Bayes point fro which the nae of the algorith was derived In order to estiate the centre of ass we suggest both a dynaic syste called a kernel billiard and an approxiative ethod that uses the perceptron algorith trained on perutations of the training saple The latter ethod proves to be efficient enough to ake the Bayes point achine applicable to large data sets An additional insight into the usefulness of the centre of ass coes fro the statistical echanics approach to neural coputing where the generalisation error for Bayesian learning algoriths has been calculated for the case of randoly constructed and unbiased patterns x (Opper and Haussler, 99) Thus if ζ is the nuber of training exaples per weight and ζ is large, the generalisation error of the centre of ass scales as 44/ζ whereas scaling with ζ is poorer for the solutions found by the linear support vector achine (scales as 5/ζ; see Opper and Kinzel, 995), Adaline (scales as 24/ ζ; see Opper et al, 99) and other approaches Of course any of the viewpoints and algoriths presented in this paper are based on extensive previous work carried out by nuerous authors in the past In particular it sees worthwhile to ention that linear classifiers have been studied intensively in two rather distinct counities: The achine learning counity and the statistical physics counity While it is beyond the scope of this paper to review the entire history of the field we would like to ephasise that our geoetrical viewpoint as expressed later in the paper has been inspired by the very original paper Playing billiard in version space by P Ruján (Ruján, 997) Also, in that paper the ter Bayes point was coined and the idea of using a billiardlike dynaical syste for unifor sapling was introduced Both we (Herbrich et al, 999a,b, 2a) and Ruján and Marchand (2) independently generalised the algorith to be applicable in kernel space Finally, following a theoretical suggestion of Watkin (993) we were able to scale up the Bayes point algorith to large data sets by using different perceptron solutions fro perutations of the training saple The paper is structured as follows: In the following section we review the basic ideas of Bayesian inference with a particular focus on classification learning Along with a discussion about the optiality of the Bayes classification strategy we show that for the special case of linear classifiers in feature space the centre of ass of all consistent classifiers is arbitrarily close to the Bayes point (with increasing training saple size) and can be efficiently estiated in the linear span of the training data Moreover, we give a geoetrical picture of support vector learning in feature space which reveals that the support vector achine can be viewed as an approxiation to the Bayes point achine In Section 3 we present two algoriths for the estiation of the centre of ass of version space one exact ethod and an approxiate ethod tailored for large training saples An extensive list of experiental results is presented in Section 4, both on sall achine learning benchark datasets as well as on large scale datasets fro the field of handwritten digit recognition In Section 5 we suarise the results and discuss soe theoretical extensions of the ethod presented In order to unburden the ain text, the lengthy proofs as well as the pseudocode have been relegated to the appendix We denote n tuples by italic bold letters (eg x =(x,,x n )), vectors by roan bold letters (eg x), rando variables by sans serif font (eg X) and vector spaces by calligraphic capitalised letters (eg X ) The sybols P,E and I denote a probability easure, the expectation of a rando variable and the indicator function, respectively 246
3 BAYES POINT MACHINES 2 A Bayesian Consideration of Learning In this section we would like to revisit the Bayesian approach to learning (see Buntine, 992; MacKay, 99; Neal, 996; Bishop, 995, for a ore detailed treatent) Suppose we are given a training saple z =(x,y)=((x,y ),,(x,y )) (X Y ) of size drawn iid fro an unknown distribution P Z = P XY Furtherore, assue we are given a fixed set H Y X of functions h : X Y referred to as hypothesis space The task of learning is then to find the function h which perfors best on new yet unseen patterns z =(x,y) drawn according to P XY Definition (Learning Algorith) A (deterinistic) learning algorith A : = Z Y X is a apping fro training saples z of arbitrary size N to functions fro X to Y The iage of A, ie {A(z) z Z } Y X, is called the effective hypothesis space H A, of the learning algorith A for the training saple size N If there exists a hypothesis space H Y X such that for every training saple size N we have H A, H we shall oit the indices on H In order to assess to quality of a function h H we assue the existence of a loss function l : Y Y R + The loss l (y,y ) R + is understood to easure the incurred cost when predicting y while the true output was y Hence we always assue that for all y Y, l (y,y)= A typical loss function for classification is the so called zeroone loss l defined as follows Definition 2 (ZeroOne Loss) Given a fixed output space Y, the zeroone loss is defined by l ( y,y ) := I y y Based on the concept of a loss l, let us introduce several quality easures for hypotheses h H Definition 3 (Generalisation and Training Error) Given a probability easure P XY and a loss l : Y Y R + the generalisation error R[h] of a function h : X Y is defined by R[h] := E XY [l (h(x),y)] Given a training saple z =(x,y) (X Y ) of size and a loss l : Y Y R + the training error R ep [h,z] of a function h : X Y is given by R ep [h,z] := l (h(x i ),y i ) i= Clearly, only the generalisation error R[h] is appropriate to capture the perforance of a fixed classifier h H on new patterns z =(x,y) Nonetheless, we shall see that the training error plays a crucial role as it provides an estiate of the generalisation error based on the training saple Definition 4 (Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for any fixed training saple size N the generalisation error R [A] of A is defined by R [A] := E Z [R[A(Z)]], that is, the expected generalisation error of the hypotheses found by the algorith Note that for any loss function l : Y Y R + a sall generalisation error R [A] of the algorith A guarantees a sall generalisation error for ost randoly drawn training saples z because by Markov s inequality we have for ε >, P Z (R[A(Z)] > ε E Z [R[A(Z)]]) ε Hence we can view R [A] also as a perforance easure of A s hypotheses for randoly drawn training saples z Finally, let us consider a probability easure P H over the space of all possible appings fro X to Y Then, the average generalisation error of a learning algorith A is defined as follows 247
4 HERBRICH, GRAEPEL, CAMPBELL Definition 5 (Average Generalisation Error of Algoriths) Suppose we are given a fixed learning algorith A : = Z Y X Then for each fixed training saple size N the average generalisation error R [A] of A is defined by R [A] := E H [ EZ H=h [ EX [ EY X=x,H=h [l ((A(Z))(x),Y)] ]]], () that is, the average perforance of the algorith s A solution learned over the rando draw of training saples and target hypotheses The average generalisation error is the standard easure of perforance of an algorith A if we have little knowledge about the potential function h that labels all our data expressed via P H Then, the easure () averages out our ignorance about the unknown h thus considering perforance of A on average There is a noticeable relation between R [A] and R [A] if we assue that given a easure P H, the conditional distribution of outputs y given x is governed by P Y X=x (y)=p H (H(x)=y) (2) Under this condition we have that R [A]=R [A] This result, however, is not too surprising taking into account that under the assuption (2) the easure P H fully encodes the unknown relationship between inputs x and outputs y 2 The Bayesian Solution In the Bayesian fraework we are not siply interested in h := argin h H R[h] itself but in our knowledge or belief in h To this end, Bayesians use the concept of prior and posterior belief, ie the knowledge of h before having seen any data and after having seen the data which in the current case is our training saple z It is well known that under consistency rules known as Cox s axios (Cox, 946) beliefs can be apped onto probability easures P H Under these rather plausible conditions the only consistent way to transfer prior belief P H into posterior belief P H Z =z is therefore given by Bayes theore: P Z P H Z =z (h) = H=h (z) [ E H PZ H=h (z) ] P H (h)= PY X =x,h=h (y) [ E H PY X =x,h=h (y) ] P H (h) (3) The second expression is obtained by noticing that P Z H=h (z)=p Y X =x,h=h (y)p X H=h (x)=p Y X =x,h=h (y)p X (x) because hypotheses do not have an influence on the generation of patterns Based on a given loss function l we can further decopose the first ter of the nuerator of (3) known as the likelihood of h Let us assue that the probability of a class y given an instance x and an hypothesis h is inverse proportional to the exponential of the loss incurred by h on x Thus we obtain exp( β l (h(x),y)) P Y X=x,H=h (y) = exp( β l (h(x),y )) = exp( β l (h(x),y)) C (x) y Y { +exp( β) if l (h(x),y) := l (h(x),y)= = exp( β) +exp( β) if l (h(x),y) := l (h(x),y)=, (4) where C (x) is a noralisation constant which in the case of the zeroone loss l is independent 2 of x and β controls the assued level of noise Note that the loss used in the exponentiated loss likelihood function In fact, it already suffices to assue that E Y X=x [l (y,y)] = E H [l (y,h(x))], ie the prior correctly odels the conditional distribution of the classes as far as the fixed loss is concerned 2 Note that for loss functions with realvalued arguents this need not be the case which akes a noralisation independent of x quite intricate (see Sollich, 2, for a detailed treatent) 248
5 BAYES POINT MACHINES is not to be confused with the decisiontheoretic loss used in the Bayesian fraework, which is introduced only after a posterior has been obtained in order to reach a risk optial decision Definition 6 (PAC Likelihood) Suppose we are given an arbitrary loss function l : Y Y R + Then, we call the function P Y X=x,H=h (y) := I y=h(x), (5) of h the PAC likelihood for h Note that (5) is the liiting case of (4) for β Assuing the PAC likelihood it iediately follows that for any prior belief P H the posterior belief P H Z =z siplifies to { PH (h) P H (V (z)) if h V (z) P H Z =z (h)=, (6) ifh / V (z) where the version space V (z) is defined as follows (see Mitchell, 977, 982) Definition 7 (Version Space) Given an hypothesis space H Y X and a training saple z =(x,y) (X Y ) of size N the version space V (z) H is defined by V (z) := { h H i {,,} : h(x i )=y i } Since all inforation contained in the training saple z is used to update the prior P H by equation (3) all that will be used to classify a novel test point x is the posterior belief P H Z =z 22 The Bayes Classification Strategy In order to classify a new test point x, for each class y the Bayes classification strategy 3 deterines the loss incurred by each hypothesis h H applied to x and weights it according to its posterior probability P H Z =z (h) The final decision is ade for the class y Y that achieves the iniu expected loss, ie Bayes z (x) := argin y Y This strategy has the following appealing property E H Z =z [l (H(x),y)] (7) Theore 8 (Optiality of the Bayes Classification Strategy) Suppose we are given a fixed hypothesis space H Y X Then, for any training saple size N, for any syetric loss l : Y Y R +, for any two easures P H and P X, aong all learning algoriths the Bayes classification strategy Bayes z given by (7) iniises the average generalisation error R [Bayes z ] under the assuption that for each h with P H (h) > y Y : x X : E Y X=x,H=h [l (y,y)] = l (y,h(x)) (8) Proof Let us consider a fixed learning algorith A Then it holds true that [ [ [ R [A] = E H EZ H=h EX EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EH EZ H=h EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ [ = E X EZ EH Z =z EY X=x,H=h [l ((A(Z))(x),Y)] ]]] [ [ = E X EZ EH Z =z [l ((A(Z))(X),H(X))] ]], (9) where we exchanged the order of expectations over X in the second line, applied the theore of repeated integrals (see, eg Feller, 966) in the third line and finally used (8) in the last line Using the syetry of the loss function, the innerost expression of (9) is iniised by the Bayes classification strategy (7) 3 The reason we do not call this apping fro X to Y a classifier is that the resulting apping is (in general) not within the hypothesis space considered beforehand 249
6 HERBRICH, GRAEPEL, CAMPBELL for any possible training saple z and any possible test point x Hence, (7) iniises the whole expression which proves the theore In order to enhance the understanding of this result let us consider the siple case of l = l and Y = {,+} Then, given a particular classifier h H having nonzero prior probability P H (h) >, by assuption (8) we require that the conditional distribution of classes y given x is delta peaked at h(x) because E Y X=x,H=h (l (y,y)) = l (y,h(x)), P Y X=x,H=h ( y) = I y h(x), P Y X=x,H=h (y) = I h(x)=y Although for a fixed h H drawn according to P H we do not know that Bayes z achieves the sallest generalisation error R[Bayes z ] we can guarantee that on average over the rando draw of h s the Bayes classification strategy is superior In fact, the optial classifier for a fixed h H is siply h itself 4 and in general Bayes z (x) h(x) for at least a few x X 23 The Bayes Point Algorith Although the Bayes classification strategy is on average the optial strategy to perfor when given liited aount of training data z, it is coputationally very deanding as it requires the evaluation of P H Z =z (l (H(x),y)) for each possible y at each new test point x (Graepel et al, 2) The proble arises because the Bayes classification strategy does not correspond to any one single classifier h H One way to tackle this proble is to require the classifier A(z) learned fro any training saple z to lie within a fixed hypothesis space H Y X containing functions h H whose evaluation at a particular test point x can be carried out efficiently Thus if it is additionally required to liit the possible solution of a learning algorith to a given hypothesis space H Y X, we can in general only hope to approxiate Bayes z Definition 9 (Bayes Point Algorith) Suppose we are given a fixed hypothesis space H X Y and a fixed loss l : Y Y R + Then, for any two easures P X and P H, the Bayes point algorith A bp is given by A bp (z) := argin h H E X [ EH Z =z [l (h(x),h(x))] ], that is, for each training saple z Z the Bayes point algorith chooses the classifier h bp := A bp (z) H that iics best the Bayes classification strategy (7) on average over randoly drawn test points The classifier A bp (z) is called the Bayes point Assuing the correctness of the odel given by (8) we furtherore reark that the Bayes point algorith A bp is the best approxiation to the Bayes classification strategy (7) in ters of the average generalisation error, ie easuring the distance of the learning algorith A for H using the distance A Bayes = R [A] R [Bayes] In this sense, for a fixed training saple z we can view the Bayes point h bp as a projection of Bayes z into the hypothesis space H Y X The difficulty with the Bayes point algorith, however, is the need to know the input distribution P X for the deterination of the hypothesis learned fro z This soehow liits the applicability of the algorith as opposed to the Bayes classification strategy which requires only broad prior knowledge about the underlying relationship expressed via soe prior belief P H 4 It is worthwhile entioning that the only inforation to be used in any classification strategy is the training saple z and the prior P H Hence it is ipossible to detect which classifier h H labels a fixed tuple x only on the basis of the labels y observed on the training saple Thus, although we ight be lucky in guessing h for a fixed h H and z Z we cannot do better than the Bayes classification strategy Bayes z when considering the average perforance the average being taken over the rando choice of the classifiers and the training saples z 25
7 BAYES POINT MACHINES 23 THE BAYES POINT FOR LINEAR CLASSIFIERS We now turn our attention to the special case of linear classifiers where we assue that N easureents of the objects x are taken by features φ i : X R thus foring a (vectorial) feature ap φ : X K l N 2 = (φ (x),,φ N (x)) Note that by this forulation the special case of vectorial objects x is autoatically taken care of by the identity ap φ(x)=x For notational convenience we use the shorthand notation 5 x for φ(x) such that x,w := N i= φ i (x)w i Hence, for a fixed apping φ the hypothesis space is given by H := { x sign( x,w ) w W }, W := {w K w = } () As each hypothesis h w is uniquely defined by its weight vector w we shall in the following consider prior beliefs P W over W, ie possible weight vectors (of unit length), in place of priors P H By construction, the output space is Y = {,+} and we furtherore consider the special case of l = l as defined by Definition 2 If we assue that the input distribution is spherically Gaussian in the feature space K of diensionality d = di(k ), ie f X (x)= ( exp x 2), () then we find that the centre of ass w c = π d 2 E W Z =z [W] EW Z =z [W] is a very good approxiation to the Bayes point w bp and converges towards w bp if the posterior belief P W Z =z becoes sharply peaked (for a siilar result see Watkin, 993) Theore (Optiality of the Centre of Mass) Suppose we are given a fixed apping φ : X K l N 2 Then, for all N, ifp X possesses the density () and the prior belief is correct, ie (8) is valid, the average generalisation error of the centre of ass as given by (2) always fulfils R [A c ] R [ Abp ] E Z [κ(ε(z))], (2) where and { arccos(ε) κ(ε) := π ε 2 if ε < 23 otherwise ε(z) := in w c,w w:p W Z =z (w)>, The lengthy proof of this theore is given in Appendix A The interesting fact to note about this result is that li ε κ(ε)= and thus whenever the prior belief P W is not vanishing for soe w, li E Z [κ(ε(z))] =, because for increasing training saple size the posterior is sharply peaked at the weight vector labelling the data 6 This shows that for increasing training saple size the centre of ass (under the posterior P W Z =z) is a good approxiation to the optial projection of the Bayes classification strategy the Bayes point Henceforth, any algorith which ais at returning the centre of ass under the posterior P W Z =z is called a Bayes point achine Note that in the case of the PAC likelihood as defined in Definition 6 the centre of ass under the posterior P W Z =z coincides with the centre of ass of version space (see Definition 7) 5 This should not be confused with x which denotes the saple (x,,x ) of training objects 6 This result is a slight generalisation of the result in Watkin (993) which only proved this to be true for the unifor prior P W 25
8 HERBRICH, GRAEPEL, CAMPBELL Ü Ü Û ¼ µ Û Ü Û Ü Û ¼ Figure : Shown is the argin a = γ x (w)= x,w under the assuption that w = x = At the sae tie, a (length of the dotted line) equals the distance of x fro the hyperplane {x x,w = } (dashed line) as well as the distance of the weight vector w fro the hyperplane {w x,w = } (dashed { line) Note, } however, that the Euclidean distance of w fro the separating boundary w W x,w = equals b(a) where b is a strictly onotonic function of its arguent 24 A (Pseudo) Bayesian Derivation of the Support Vector Machine In this section we would like to show that the well known support vector achine (Boser et al, 992; Cortes, 995; Vapnik, 995) can also be viewed as an approxiation to the centre of ass of version space V (z) in the noise free scenario, ie considering the PAC likelihood given in Definition 6, and additionally assuing that x i x : x i = φ(x i ) = const In order to see this let us recall that the support vector achine ais at axiising the argin γ z (w) of the weight vector w on the training saple z given by γ z (w) := in i {,,} y i x i,w } w {{ } γ xi (w) = in w y i x i,w, (3) i {,,} which for all w of unit length is erely the inial realvalued output (flipped to the correct sign) over the whole training saple In order to solve this proble algorithically one takes advantage of the fact that fixing the realvalued output to one (rather than the nor w of the weight vector w) renders the proble of finding the argin axiiser w SVM as a proble with a quadratic objective function ( w 2 = w w) under linear constraints (y i x i,w ), ie ( ) w SVM := argax w W in y i x i,w i {,,} argin w {v ini {,,} y i x i,v =} (4) ( w 2) (5) Note that the set of weight vectors in (5) are called the weight vectors of the canonical hyperplanes (see Vapnik, 998, p 42) and that this set is highly dependent on the given training saple Nonetheless, the solution to (5) is (up to scaling) equivalent to the solution of (4) a forulation uch ore aenable for theoretical studies Interestingly, however, the quantity γ xi (w) as iplicitly defined in (3) is not only the distance of the point y i x i fro the hyperplane having the noral w but also x i ties the Euclidean distance of the point w fro the hyperplane having the noral y i x i (see Figure ) Thus γ z (w) can be viewed as the radius of 252
9 BAYES POINT MACHINES the ball { v W w v b(γ z (w)) } that only contains weight vectors in version space V (z) Here, b : R + R + is a strictly onotonic function of its arguent and its effect is graphically depicted in Figure As a consequence thereof, axiising the argin γ z (w) over the choice of w returns the classifier w SVM that is the centre of the largest ball still inscribable in version space Note that the whole reasoning relied on the assuption that all training points x i have a constant nor in feature space K If this assuption is violated, each distance of a classifier w to the hyperplane having the noral y i x i is easured on a different scale and thus the points with the largest nor x i in feature space K have the highest influence on the resulting solution To circuvent this proble is has been suggested elsewhere that input vectors should be noralised in feature space before applying any kernel ethod in particular the support vector achine algorith (see Herbrich and Graepel, 2; Schölkopf et al, 999; Joachis, 998; Haussler, 999) Furtherore, all indices I SV {,,} at which the iniu y i x i,w SVM in (4) is attained are the ones for which y i x i,w = in the forulation (5) As the latter are called support vectors we see that the support vectors are the training points at which the largest inscribable ball touches the corresponding hyperplane { w W (yi x i,w = ) } 25 Applying the Kernel Trick When solving (5) over the possible choices of w W it is well known that the solution w SVM adits the following representation w SVM = α i x i, that is the solution to (5) ust live in the linear span of the training points This follows naturally fro the following theore (see also Schölkopf et al, 2) Theore (Representer Theore) Suppose we are given a fixed apping φ : X K l N 2, a training saple z =(x,y) Z, a cost function c : X Y R R { } strictly onotonically decreasing in the third arguent and the class of linear functions in K as given by () Then any w z W defined by w z := argin c(x,y,( x,w,, x,w )) (6) w W adits a representation of the for i= α R : w z = i= α i x i (7) The proof is given in Appendix A2 In order to see that this theore applies to support vector achines note that (4) is equivalent to the iniiser of (6) when using c(x,y,( x,w,, x,w )) = in y i y y i x i,w, which is strictly onotonically decreasing in its third arguent A slightly ore difficult arguent is necessary to see that the centre of ass (2) can also be written as a iniiser of (6) using a[ specific cost function c At first we recall that the centre of ass has the property of iniising E W Z =z w W 2] over the choice of w W (see also (3)) Theore 2 (Sufficiency of the linear span) Suppose we are given a fixed apping φ : X K l N 2 Let us assue that P W is unifor and P Y X=x,W=w (y) = f (sign(y x,w )), ie the likelihood depends on the sign of the realvalued output y x,w of w Let L x := { i= α ix i α R } be the linear span of apped data points {x,,x } and W x := W L x Then for any training saple z Z and any w W w W v 2 dp W Z =z (v)=c w v 2 dp W Z W =z (v), (8) x 253
10 HERBRICH, GRAEPEL, CAMPBELL that is, up to a constant C R + that is independent of w it suffices to consider vectors of unit length in the linear span of the apped training points {x,,x } The proof is given in Appendix A3 An iediate consequence of this theore is the fact that we only need to consider the diensional sphere W x in order to find the centre of ass under the assuption of a unifor prior P W Hence a loss function c such that (6) finds the centre of ass is given by ( ) c(x,y,( x,w,, x,w )) = 2 α i x i,w dp A Z R =(x,y) where P A Z =z is only nonzero for vectors α such that i= α ix i = and is independent of w The treendous advantage of a representation of the solution w z by (7) becoes apparent when considering the realvalued output of a classifier at any given data point (either training or test point) w z,x = α i x i,x = α i x i,x = α i k (x i,x) i= i= i= Clearly, all that is needed in the feature space K is the inner product function k (x, x) := φ(x),φ( x) (9) Reversing the chain of arguents indicates how the kernel trick ay be used to find an efficient ipleentation We fix a syetric function k : X X R called kernel and show that there exists a feature apping φ k : X K l N 2 such that (9) is valid for all x, x X A sufficient condition for k being a valid inner product function is given by Mercer s theore (see Mercer, 99) In a nutshell, whenever the evaluation of k at any given saple (x,,x ) results in a positive seidefinite atrix G ij := k (x i,x j ) then k is a so called Mercer kernel The atrix G is called the Gra atrix and is the only quantity needed in support vector and Bayes point achine learning For further details on the kernel trick the reader is referred to Schölkopf et al (999); Cristianini and ShaweTaylor (2); Wahba (99); Vapnik (998) 3 Estiating the Bayes Point in Feature Space In order to estiate the Bayes point in feature space K we consider a Monte Carlo ethod, ie instead of exactly coputing the expectation (2) we approxiate it by an average over weight vectors w drawn according to P W Z =z and restricted to W x (see Theore 2) In the following we will restrict ourselves to the PAC likelihood given in (5) and P W being unifor on the unit sphere W K By this assuption we know that the posterior is unifor over version space (see (6)) In Figure 2 we plotted an exaple for the special case of N = 3 diensional feature space K It is, however, already very difficult to saple uniforly fro version space V (z) as this set of points lives on a convex polyhedron on the unit sphere in 7 W x In the following two subsections we present two ethods to achieve this sapling The first ethod develops on an idea of Ruján (997) (later followed up by a kernel version of the algorith in Ruján and Marchand, 2) that is based on the idea of playing billiards in version space V (z), ie after entering the version space with a very siple learning algorith such as the kernel perceptron (see Algorith ) the classifier w isconsidered as a billiardball and isbounced fora while within the convex polyhedron V (z) If this billiard is ergodic with respect to the unifor distribution over V (z), ie the travel tie of the billiard ball spent in a subset W V (z) is proportional to W V (z), then averaging over the trajectory of the billiard ball leads in the liit of an infinite nuber of bounces to the centre of ass of version space The second ethod presented tries to overcoe the large coputational deands of the billiard ethod by only approxiately achieving a unifor sapling of version space The idea is to use the perceptron 7 Note that by Theore 2 it suffices to saple fro the projection of the version space onto W x 254
11 BAYES POINT MACHINES Figure 2: Plot of a version space (convex polyhedron containing the black dot) V (z) in a 3 diensional feature space K Each hyperplane is defined by a training exaple via its noral vector y i x i learning algorith in dual variables with different perutations Π : {,,} {,,} so as to obtain different consistent classifiers w i V (z) (see Watkin, 993, for a siilar idea) Obviously, the nuber of different saples obtained is finite and thus it is ipossible to achieve exactness of the ethod in the liit of considering all perutations Nevertheless, we shall deonstrate that in particular for the task of handwritten digit recognition the achieved perforances are coparable to stateoftheart learning algoriths Finally, we would like to reark that recently there have been presented other efficient ethods to estiate the Bayes point directly (Rychetsky et al, 2; Minka, 2) The ain idea in Rychetsky et al (2) is to work out all corners w i of version space and average over the in order to approxiate the centre of ass of version space Note that there are exactly corners because the i th corner w i satisfies x j,w i = for all j i and yi x i,w i > If X =(x,,x ) is the N atrix of apped training points x =(x,,x ) flipped to their correct side and we use the approach (7) for w this siplifies to X w i = X Xα i = Gα i =(,,,y i,,) =: y i e i where the rhs is the i th unit vector ultiplied by y i As a consequence, the expansion coefficients α i of the i th corner w i can easily be coputed as α i = y i G e i and then need to be noralised such that w i = The difficulty with this approach, however, is the fact that the inversion of the Gra atrix G is O ( 3) and is thus as coputationally coplex as support vector learning while not enjoying the anytie property of a sapling schee The algorith presented in Minka (2, Chapter 5) (also see Opper and Winther, 2, for an equivalent ethod) uses the idea of approxiating the posterior easure P W Z =z by a product of Gaussian densities so that the centre of ass can be coputed analytically Although the approxiation of the cutoff posterior over P W Z =z resulting fro the deltapeaked likelihood given in Definition 6 by Gaussian easures sees very crude at first glance, Minka could show that his ethod copares favourably to the results presented in this paper 3 Playing Billiards in Version Space In this subsection we present the billiard ethod to estiate the Bayes point, ie the centre of ass of version space when assuing a PAC likelihood and a unifor prior P W over weight vectors of unit length (the pseudo 255
12 HERBRICH, GRAEPEL, CAMPBELL º Û Ý Ü Û ¼ Û Ý Ü Û ¼ Û Ý ¾ Ü ¾ Û ¼ ½ Û Ñ ¼ ¾ Û Ý ½ Ü ½ Û ¼ Figure 3: Scheatic view of the kernel billiard algorith Starting at b V (z) a trajectory of billiard bounces b,,b 5, is calculated and then averaged over so as to obtain an estiate ŵ c of the centre of ass of version space code is given on page 275) By Theore 2 each position b of the billiard ball and each estiate w i of the centre of ass of V (z) can be expressed as linear cobinations of the apped input points, ie w = i= α i x i, b = i= γ i x i, α,γ R Without loss of generality we can ake the following ansatz for the direction vector v of the billiard ball v = i= β i x i, β R Using this notation inner products and nors in feature space K becoe b,v = i= j= γ i β j k (x i,x j ), b 2 = i, j= γ i γ j k (x i,x j ), (2) where k : X X R is a Mercer kernel and has to be chosen beforehand At the beginning we assue that w = α = Before generating a billiard trajectory in version space V (z) we first run any learning algorith to find an initial starting point b inside the version space (eg support vector learning or the kernel perceptron (see Algorith )) Then the kernel billiard algorith consists of three steps (see also Figure 3): Deterine the closest boundary in direction v i starting fro current position b i Since it is coputationally very deanding to calculate the flight tie of the billiard ball on geodesics of the hypersphere W x (see also Neal, 997) we ake use of the fact that the shortest distance in Euclidean space (if it exists) is also the shortest distance on the hypersphere W x Thus, we have for the flight tie τ j of the billiard ball at position b i in direction v i to the hyperplane with noral vector y j x j bi,x j τ j = (2) vi,x j After calculating all flight ties, we look for the sallest positive, ie c = argin τ j j {i τ i > } 256
13 BAYES POINT MACHINES Deterining the closest bounding hyperplane in Euclidean space rather than on geodesics causes probles if the surface of the hypersphere W x is alost orthogonal to the direction vector v i, in which case τ c If this happens we randoly generate a direction vector v i pointing towards the version space V (z) Assuing that the last bounce took place at the hyperplane having noral y c x c this condition can easily be checked by y c v i,x c > (22) Note that since the saples are taking fro the bouncing points the above procedure of dealing with the curvature of the hypersphere does not constitute an approxiation but is exact An alternative ethod of dealing with the proble of the curvature of the hypersphere W can be found in Minka (2, Section 58) 2 Update the billiard ball s position to b i+ and the new direction vector to v i+ The new point b i+ and the new direction v i+ are calculated fro b i+ = b i + τ c v i, (23) v i+ = v i 2 v i,x c x c 2 x c (24) Afterwards the position b i+ and the direction vector v i+ need to be noralised This is easily achieved by equation (2) 3 Update the centre of ass w i of the whole trajectory by the new line segent fro b i to b i+ calculated on the hypersphere W x Since the solution w lies on the hypersphere W x (see Theore ) we cannot siply update the centre of ass using a weighted vector addition Let us introduce the operation µ acting on vectors of unit length This function has to have the following properties s µ t 2 =, t s µ t = µ t s, s µ t = ρ ( s,t,µ)s + ρ 2 ( s,t,µ)t, ρ ( s,t,µ), ρ 2 ( s,t,µ) This rather arcane definition ipleents a weighted addition of s and t such that µ is the fraction between the resulting chord length t s µ t and the total chord length t s In Appendix A4 it is shown that the following forulae for ρ ( s,t,µ) and ρ 2 ( s,t,µ) ipleent such a weighted addition ρ ( s,t,µ) = µ µ2 µ 2 s,t 2, s,t + ρ 2 ( s,t,µ) = ρ ( s,t,µ) s,t ± ( µ 2 ( s,t ) ) By assuing a constant line density on the anifold V (z) the whole line between b i and b i+ can be represented by the idpoint on the anifold V (z) given by = b i + b i+ b i + b i+ Thus, one updates the centre of ass of the trajectory by ( Ξ i w i+ = ρ ( w i,, )w i + ρ 2 w i,, Ξ i + ξ i Ξ i Ξ i + ξ i ), 257
14 HERBRICH, GRAEPEL, CAMPBELL where ξ i = b i b i+ is the length of the trajectory in the i th step and Ξ i = i j= ξ j for the accuulated length up to the i th step Note that the operation µ is only an approxiation to addition operation we sought because an exact weighting would require the arc lengths rather than chord lengths As a stopping criterion we suggest coputing an upper bound on ρ 2, the weighting factor of the new part of the trajectory If this value falls below a prespecified threshold (TOL) we stop the algorith Note that the increase in Ξ i will always lead to terination 32 Large Scale Bayes Point Machines Clearly, all we need for estiating the centre of ass of version space (2) is a set of unit length weight vectors w i drawn uniforly fro V (z) In order to save coputational resources it ight be advantageous to achieve a unifor saple only approxiately The classical perceptron learning algorith offers the possibility to obtain up to! different classifiers in version space siply by learning on different perutations of the training saple Of course due to the sparsity of the solution the nuber of different classifiers obtained is usually considerably less A classical theore to be found in Novikoff (962) guarantees the convergence of this procedure and furtherore provides an upper bound on the nuber t of istakes needed until convergence More precisely, if there exists a classifier w SVM with argin γ z (w SVM ) > (see (3)) then the nuber of istakes until convergence which is an upper bound on the sparsity of the solution is not ore than ς 2 γ 2 z (w SVM ), where ς is the sallest real nuber such that x i K ς The quantity γ z (w SVM ) is axiised for the solution w SVM found by the support vector achine, and whenever the support vector achine is theoretically justified by results fro learning theory (see ShaweTaylor et al, 998; Vapnik, 998) the ratio ς 2 γ 2 z (w SVM ) is considerably less than, say d Algorithically, we can benefit fro this sparsity by the following trick : since w = α i x i i= all we need to store is the diensional vector α Furtherore, we keep track of the diensional vector o of realvalued outputs o i = x i,w t = α j k (x i,x j ) of the current solution at the i th training point By definition, in the beginning α = o = Now, if o i y i < we update α i by α i + y i and update o by o j o j + y i k (x i,x j ) which requires only kernel calculations (the evaluation of the i th row of the Gra atrix G) In suary, the eory requireent of this algorith is 2 and the nuber of kernel calculations is not ore than d As a consequence, the coputational requireent of this algorith is no ore than the coputational requireent for the evaluation of the argin γ z (w SVM )! We suggest to use this efficient perceptron learning algorith in order to obtain saples w i for the coputation of the centre of ass (2) In order to investigate the usefulness of this approach experientally, we copared the distribution of generalisation errors of saples obtained by perceptron learning on peruted training saples with saples obtained by a full Gibbs sapling (see Graepel and Herbrich, 2, for details on the kernel Gibbs sapler) For coputational reasons, we used only 88 training patterns and 453 test patterns of the classes and 2 fro the MNIST data set 8 In Figure 4 (a) and (b) we plotted the distribution over rando saples using the kernel 9 k ( x,x ) = ( x,x + ) 5 (25) j= Using a quantilequantile (QQ) plot technique we can copare both distributions in one graph (see Figure 4 (c)) These plots suggest that by siple perutation of the training saple we are able to obtain a saple of classifiers exhibiting a siilar distribution of generalisation error to the one obtained by tieconsuing Gibbs sapling 8 This data set is publicly available at 9 We decided to use this kernel because it showed excellent generalisation perforance when using the support vector achine 258
15 BAYES POINT MACHINES frequency frequency kernel perceptron generalisation error generalisation error kernel Gibbs sapler (a) (b) (c) Figure 4: (a) Histogra of generalisation errors (estiated on a test set) using a kernel Gibbs sapler (b) Histogra of generalisation errors (estiated on a test set) using a kernel perceptron (c) QQ plot of distributions (a) and (b) The straight line indicates that the two distributions only differ by an additive and ultiplicative constant, ie they exhibit the sae rate of decay A very advantageous feature of this approach as copared to support vector achines are its adjustable tie and eory requireents and the anytie availability of a solution due to sapling If the training saple grows further and we are not able to spend ore tie learning, we can adjust the nuber of saples w used at the cost of slightly worse generalisation error (see also Section 4) 33 Extension to Training Error To allow for training errors we recall that the version space conditions are given by (x i,y i ) z : y i x i,w = y i j=α j k (x i,x j ) > (26) Now we introduce the following version space conditions in place of (26): (x i,y i ) z : y i j=α j k (x i,x j ) > λy i α i k(x i,x i ), (27) where λ is an adjustable paraeter related to the softness of version space boundaries Clearly, considering this fro the billiard viewpoint, equation (27) can be interpreted as allowing penetration of the walls, an idea already hinted at in Ruján (997) Since the linear decision function is invariant under any positive rescaling of expansion coefficients α, a factor α i on the right hand side akes λ scale invariant as well Although other ways of incorporating training errors are conceivable our forulation allows for a siple odification of the algoriths described in the previous two subsections To see this we note that equation (27) can be rewritten as ) > (x i,y i ) z : y i ( j=α j ( + λi i= j )k (x i,x j ) Hence we can use the above algoriths but with an additive correction to the diagonal ters of the Gra atrix This additive correction to the kernel diagonals is siilar to the quadratic argin loss used to introduce a soft argin during training of support vector achines (see Cortes, 995; ShaweTaylor and Cristianini, 2) Another insight into the introduction of soft boundaries coes fro noting that the distance between two points x i and x j in feature space K can be written x i x j 2 = x i 2 + x j 2 2 xi,x j, 259
16 HERBRICH, GRAEPEL, CAMPBELL λ = λ = 5 λ = λ = 5 λ = 2 λ = 25 5 Figure 5: Paraeter spaces for a two diensional toy proble obtained by introducing training error via an additive correction to the diagonal ter of the kernel atrix In order to visualise the resulting paraeter space we fixed = 3 and noralised all axes by the product of eigenvalues λ λ 2 λ 3 See text for further explanation which in the case of points of unit length in feature space becoes 2( + λ k (x i,x j )) Thus, if we add λ to the diagonal eleents of the Gra atrix, the points becoe equidistant for λ This would give the resulting version space a ore regular shape As a consequence, the centre of the largest inscribable ball (support vector achine solution) would tend towards the centre of ass of the whole of version space We would like to recall that the effective paraeter space of weight vectors considered is given by { } W x := w = α i x i w 2 = α i α j xi,x j = i= i= In ters of α this can be rewritten as { α R α Gα = } G ij = x i,x j = k (xi,x j ) Let us represent the Gra atrix by its spectral decoposition, ie G = UΛU where U U = I and Λ = diag(λ,,λ ) being the diagonal atrix of eigenvalues λ i Thus we know that the paraeter space is the set of all coefficients α = U α which fulfil { α R : α Λ α = } j= This is the defining equation of an diensional axis parallel ellipsoid Now adding the ter λ to the diagonal of G akes G a full rank atrix (see Micchelli, 986) In Figure 5 we plotted the paraeter space for a 2D toy proble using only = 3 training points Although the paraeter space is 3 diensional for all λ > we obtain a pancake like paraeter space for sall values of λ Forλ the set α of adissible coefficients becoes the diensional ball, ie the training exaples becoe ore and ore orthogonal with increasing λ The way we incorporated training errors corresponds to the choice of a new kernel given by k λ (x, x) := k (x, x)+λ I x= x 26
17 BAYES POINT MACHINES Figure 6: Version spaces V (z) for two 3 diensional toy probles (Left) One can see that the approxiation of the Bayes point (diaond) by the centre of the largest inscribable ball (cross) is reasonable if the version space is regularly shaped (Right) The situation changes in the case of an elongated and asyetric version space V (z) Finally, note that this odification of the kernel has no effect on new test points x / x that are not eleents of the training saple x For an explanation of the effect of λ in the context of Gaussian processes see Opper and Winther (2) 4 Experiental Results In this section we present experiental results both on University of California, Irvine (UCI) benchark datasets and on two bigger task of handwritten digit recognition, naely US postal service (USPS) and odified National Institute of Standards (MNIST) digit recognition tasks We copared our results to the perforance of a support vector achine using reported test set perforance fro Rätsch et al (2) (UCI) Schölkopf (997, p 57) (USPS) and Cortes (995) (MNIST) All the experients were done using Algorith 2 in Appendix B 4 Artificial Data For illustration purposes we setup a toy dataset of training and test points in R 3 The data points were uniforly generated in [,] 3 and labelled by a randoly generated linear decision rule using the kernel k (x, x) = x, x In Figures 6 we illustrate the potential benefits of a Bayes point achine over a support vector achine for elongated version spaces By using the billiard algorith to estiate the Bayes point (see Subsection 3), we were able to track all positions b i where the billiard ball hits a version space boundary This allows us to easily visualise the version spaces V (z) For the exaple illustrated in Figure 6 (right) the support vector achine and Bayes point solutions with hard argins/boundaries are far apart resulting in a noticeable reduction in generalisation error of the Bayes point achines (8%) copared to the support vector achine (5%) solution whereas for regularly shaped version spaces (Figure 6 (left)) the difference is negligible (6% to 6%) publicly available at 26
18 HERBRICH, GRAEPEL, CAMPBELL SVM BPM Y 2 Y X 2 2 X Figure 7: Decision functions for a 2D toy proble of a support vector achine (SVM) (left) and Bayes point achine (BPM) (right) using hard argins (λ = ) and RBF kernels with σ = Note that the Bayes point achine result in a uch flatter function sacrificing argin (γ z (w SVM )=36 γ z (w c )=2) for soothness In a second illustrative exaple we copared the soothness of the resulting decision function when using kernels both with support vector achines and Bayes point achines In order to odel a nonlinear decision surface we used the radial basis function (RBF) kernel ( ) x x 2 k (x, x)=exp 2σ 2 (28) Figure 7 shows the resulting decision functions in the hard argin/boundary case Clearly, the Bayes point achine solution appears uch soother than the support vector achine solution although its geoetrical argin of 2 is significantly saller The above exaples should only be considered as aids to enhance the understanding of the Bayes point achines algorith s properties rather than strict arguents about general superiority 42 UCI Benchark Datasets To investigate the perforance on real world datasets we copared hard argin support vector achines to Bayes point achines with hard boundaries (λ = ) when using the kernel billiard algorith described in Subsection 3 We studied the perforance on 5 standard bencharking datasets fro the UCI Repository, and banana and wavefor, two toy datasets (see Rätsch et al, 2) In each case the data was randoly partitioned into training and test sets in the ratio 6%:4% The eans and standard deviations of the average generalisation errors on the test sets are presented as percentages in the coluns headed SVM (hard argin) and BPM (λ = ) in Table As can be seen fro the results, the Bayes point achine outperfors support vector achines on alost all datasets at a statistically significant level Note, however, that the result of the ttest is strictly valid only under the assuption that training and test data were independent an assuption which ay be violated by the procedure of splitting the one data set into different pairs of training and test sets (Dietterich, 998) Thus, the resulting p values should serve only as an indication for the significance of the result In order to deonstrate the effect of positive λ (soft boundaries) we trained a Bayes point achine with soft boundaries and copared it to training a support vector achine with soft argin using the sae Gra 262
19 BAYES POINT MACHINES SVM (hard argin) BPM (hard boundary) σ pvalue Heart 254±4 228±34 Thyroid 53±24 44±2 3 Diabetes 33±24 32±25 5 Wavefor 3± 2±9 2 Banana 62±5 5±4 5 Sonar 54±37 59±38 Ionosphere 9±25 5± Table : Experiental results on seven benchark datasets We used the RBF kernel given in (28) with values of σ found optial for SVMs Shown is the estiated generalisation error in percent The standard deviation was obtained on different runs The final colun gives the p values of a paired t test for the hypothesis BPM is better than SVM indicating that the iproveent is statistically significant atrix (see equation (27)) It can be shown that such a support vector achine corresponds to a soft argin support vector achine where the argin slacks are penalised quadratically (see Cortes, 995; ShaweTaylor and Cristianini, 2; Herbrich, 2) In Figure 8 we have plotted the generalisation error as a function of λ for the toy proble fro Figure 6 and the dataset heart using the sae setup as in the previous experient We observe that the support vector achine with an l 2 soft argin achieves a iniu of the generalisation error which is close to, or just above, the iniu error which can be achieved using a Bayes point achine with positive λ This ay not be too surprising taking the change of geoetry into account (see Section 33) Thus, also the soft argin support vector achine approxiates Bayes point achine with soft boundaries Finally we would like to reark that the running tie of the kernel billiard was not uch different fro the running tie of our support vector achine ipleentation We did not use any chunking or decoposition algoriths (see, eg Osuna et al, 997; Joachis, 999; Platt, 999) which in case of support vector achines would have decreased the running tie by orders of agnitudes The ost noticeable difference in running tie was with the wavefor and banana dataset where we are given = 4 observations This can be explained by the fact that the coputational effort of the kernel billiard ethod is O ( B 2) where B is the nuber of bounces As we set our tolerance criterion TOL for stopping very low ( 4 ), the approxiate nuber B of bounces for these datasets was B Hence, in contrast to the coputational effort of using the support vector achines of O ( 3) the nuber B of bounces lead to a uch higher coputational deand when using the kernel billiard 43 Handwritten Digit Recognition For the two tasks we now consider our inputs are n n grey value iages which were transfored into n 2 diensional vectors by concatenation of the rows The grey values were taken fro the set {,,255} All iages were labelled by one of the ten classes to 9 For each of the ten classes y = {,,9} we ran the perceptron algorith L = ties each tie labelling all training points of class y by + and the reaining training points by On a Pentiu III 5 MHz with 28 MB eory each learning trial took 2 inutes (MNIST) or 2 inutes (USPS), respectively For the classification of a test iage x Note, however, that we ade use of the fact that 4% of the grey values of each iage are since they encode background Therefore, we encoded each iage as an indexvalue list which allows uch faster coputation of the inner products x, x and speeds up the algorith by a factor of
20 HERBRICH, GRAEPEL, CAMPBELL classification error 5 5 SVM BPM classification error SVM BPM λ λ Figure 8: Coparison of soft boundary Bayes point achine with soft argin support vector achine Plotted is the generalisation error versus λ for a toy proble using linear kernels (left) and the heart dataset using RBF kernels with σ = 3 (right) The error bars indicate one standard deviation of the estiated ean we calculated the realvalued output of all different classifiers 2 by f i (x) = x,w i w i x = r= s= (α i ) j k (x j,x) j= (α i ) r (α i ) s k (x r,x s ), k (x,x) where we used the kernel k given by (25) Here, (α i ) j refers to the expansion coefficient corresponding to the i th classifier and the j th data point Now, for each of the ten classes we calculated the realvalued decision of the Bayes point estiate ŵ c,y by 3 f bp,y (x)= x,ŵ c,y = L In a Bayesian spirit, the final decision was carried out by h bp (x) := argax y {,,9} L i= x,w i+yl f bp,y (x) Note that f bp,y (x) can be interpreted as an (unnoralised) approxiation of the posterior probability that x is of class y when restricted to the function class () (see Platt, 2) In order to test the dependence of the generalisation error on the agnitude ax y f bp,y (x) we fixed a certain rejection rate r [,] and rejected the set of r test points with the sallest value of ax y f bp,y (x) MNIST Handwritten Digits In the first of our large scale experient we used the full MNIST dataset with 6 training exaples and test exaples of grey value iages of handwritten digits The plot resulting fro learning only consistent classifiers per class and rejection based on the realvalued output of the single Bayes points is depicted in Figure 9 (left) As can be seen fro this plot, even without rejection the Bayes point has excellent generalisation perforance when copared to support vector achines which achieve a generalisation error of 4 4% Furtherore, rejection based on the realvalued 2 For notational siplicity we assue that the first L classifiers are classifiers for the class, the next L for class and so on 3 Note that in this subsection y ranges fro {,,9} 4 The result of % with the kernel (25) and a polynoial degree of four could not be reproduced and is thus considered invalid (personal counication with P Haffner) Note also that the best results with support vector achines were obtained when using a soft argin 264
Online Bagging and Boosting
Abstract Bagging and boosting are two of the ost wellknown enseble learning ethods due to their theoretical perforance guarantees and strong experiental results. However, these algoriths have been used
More informationLecture L263D Rigid Body Dynamics: The Inertia Tensor
J. Peraire, S. Widnall 16.07 Dynaics Fall 008 Lecture L63D Rigid Body Dynaics: The Inertia Tensor Version.1 In this lecture, we will derive an expression for the angular oentu of a 3D rigid body. We shall
More informationData Set Generation for Rectangular Placement Problems
Data Set Generation for Rectangular Placeent Probles Christine L. Valenzuela (Muford) Pearl Y. Wang School of Coputer Science & Inforatics Departent of Coputer Science MS 4A5 Cardiff University George
More informationMachine Learning Applications in Grid Computing
Machine Learning Applications in Grid Coputing George Cybenko, Guofei Jiang and Daniel Bilar Thayer School of Engineering Dartouth College Hanover, NH 03755, USA gvc@dartouth.edu, guofei.jiang@dartouth.edu
More informationAUTOMATIC SATELLITE IMAGE REGISTRATION BY COMBINATION OF STEREO MATCHING AND RANDOM SAMPLE CONSENSUS
AUTOATIC SATELLITE IAGE REGISTRATION BY COBINATION OF STEREO ATCHING AND RANDO SAPLE CONSENSUS Taejung Ki* YongJo I** *Satellite Technology Research Center Korea Advanced Institute of Science and Technology
More informationReliability Constrained Packetsizing for Linear Multihop Wireless Networks
Reliability Constrained acketsizing for inear Multihop Wireless Networks Ning Wen, and Randall A. Berry Departent of Electrical Engineering and Coputer Science Northwestern University, Evanston, Illinois
More informationImage restoration for a rectangular poorpixels detector
Iage restoration for a rectangular poorpixels detector Pengcheng Wen 1, Xiangjun Wang 1, Hong Wei 2 1 State Key Laboratory of Precision Measuring Technology and Instruents, Tianjin University, China 2
More informationUse of extrapolation to forecast the working capital in the mechanical engineering companies
ECONTECHMOD. AN INTERNATIONAL QUARTERLY JOURNAL 2014. Vol. 1. No. 1. 23 28 Use of extrapolation to forecast the working capital in the echanical engineering copanies A. Cherep, Y. Shvets Departent of finance
More informationCRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS
641 CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS Marketa Zajarosova 1* *Ph.D. VSB  Technical University of Ostrava, THE CZECH REPUBLIC arketa.zajarosova@vsb.cz Abstract Custoer relationship
More informationSOME APPLICATIONS OF FORECASTING Prof. Thomas B. Fomby Department of Economics Southern Methodist University May 2008
SOME APPLCATONS OF FORECASTNG Prof. Thoas B. Foby Departent of Econoics Southern Methodist University May 8 To deonstrate the usefulness of forecasting ethods this note discusses four applications of forecasting
More informationarxiv:0805.1434v1 [math.pr] 9 May 2008
Degreedistribution stability of scalefree networs Zhenting Hou, Xiangxing Kong, Dinghua Shi,2, and Guanrong Chen 3 School of Matheatics, Central South University, Changsha 40083, China 2 Departent of
More informationApplying Multiple Neural Networks on Large Scale Data
0 International Conference on Inforation and Electronics Engineering IPCSIT vol6 (0) (0) IACSIT Press, Singapore Applying Multiple Neural Networks on Large Scale Data Kritsanatt Boonkiatpong and Sukree
More information6. Time (or Space) Series Analysis
ATM 55 otes: Tie Series Analysis  Section 6a Page 8 6. Tie (or Space) Series Analysis In this chapter we will consider soe coon aspects of tie series analysis including autocorrelation, statistical prediction,
More informationSearching strategy for multitarget discovery in wireless networks
Searching strategy for ultitarget discovery in wireless networks Zhao Cheng, Wendi B. Heinzelan Departent of Electrical and Coputer Engineering University of Rochester Rochester, NY 467 (585) 75{878,
More informationBinary Embedding: Fundamental Limits and Fast Algorithm
Binary Ebedding: Fundaental Liits and Fast Algorith Xinyang Yi The University of Texas at Austin yixy@utexas.edu Eric Price The University of Texas at Austin ecprice@cs.utexas.edu Constantine Caraanis
More informationGuidelines for calculating sample size in 2x2 crossover trials : a simulation study
J.Natn.Sci.Foundation Sri Lanka 011 39 (1): 7789 RESEARCH ARTICLE Guidelines for calculating saple size in x crossover trials : a siulation study N.M. Siyasinghe and M.R. Sooriyarachchi Departent of Statistics,
More informationExact Matrix Completion via Convex Optimization
Exact Matrix Copletion via Convex Optiization Eanuel J. Candès and Benjain Recht Applied and Coputational Matheatics, Caltech, Pasadena, CA 91125 Center for the Matheatics of Inforation, Caltech, Pasadena,
More informationON SELFROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX 778433128
ON SELFROUTING IN CLOS CONNECTION NETWORKS BARRY G. DOUGLASS Electrical Engineering Departent Texas A&M University College Station, TX 7788 A. YAVUZ ORUÇ Electrical Engineering Departent and Institute
More informationSemiinvariants IMOTC 2013 Simple semiinvariants
Seiinvariants IMOTC 2013 Siple seiinvariants These are soe notes (written by Tejaswi Navilarekallu) used at the International Matheatical Olypiad Training Cap (IMOTC) 2013 held in Mubai during AprilMay,
More informationAirline Yield Management with Overbooking, Cancellations, and NoShows JANAKIRAM SUBRAMANIAN
Airline Yield Manageent with Overbooking, Cancellations, and NoShows JANAKIRAM SUBRAMANIAN Integral Developent Corporation, 301 University Avenue, Suite 200, Palo Alto, California 94301 SHALER STIDHAM
More informationIV Approximation of Rational Functions 1. IV.C Bounding (Rational) Functions on Intervals... 4
Contents IV Approxiation of Rational Functions 1 IV.A Constant Approxiation..................................... 1 IV.B Linear Approxiation....................................... 3 IV.C Bounding (Rational)
More informationAlgorithmica 2001 SpringerVerlag New York Inc.
Algorithica 2001) 30: 101 139 DOI: 101007/s0045300100030 Algorithica 2001 SpringerVerlag New York Inc Optial Search and OneWay Trading Online Algoriths R ElYaniv, 1 A Fiat, 2 R M Karp, 3 and G Turpin
More informationInformation Processing Letters
Inforation Processing Letters 111 2011) 178 183 Contents lists available at ScienceDirect Inforation Processing Letters www.elsevier.co/locate/ipl Offline file assignents for online load balancing Paul
More informationExercise 4 INVESTIGATION OF THE ONEDEGREEOFFREEDOM SYSTEM
Eercise 4 IVESTIGATIO OF THE OEDEGREEOFFREEDOM SYSTEM 1. Ai of the eercise Identification of paraeters of the euation describing a onedegreeof freedo (1 DOF) atheatical odel of the real vibrating
More informationAN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES
Int. J. Appl. Math. Coput. Sci., 2014, Vol. 24, No. 1, 133 149 DOI: 10.2478/acs20140011 AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES PIOTR KULCZYCKI,,
More informationAn Innovate Dynamic Load Balancing Algorithm Based on Task
An Innovate Dynaic Load Balancing Algorith Based on Task Classification Hongbin Wang,,a, Zhiyi Fang, b, Guannan Qu,*,c, Xiaodan Ren,d College of Coputer Science and Technology, Jilin University, Changchun
More informationCooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks
Cooperative Caching for Adaptive Bit Rate Streaing in Content Delivery Networs Phuong Luu Vo Departent of Coputer Science and Engineering, International University  VNUHCM, Vietna vtlphuong@hciu.edu.vn
More informationMultiClass Deep Boosting
MultiClass Deep Boosting Vitaly Kuznetsov Courant Institute 25 Mercer Street New York, NY 002 vitaly@cis.nyu.edu Mehryar Mohri Courant Institute & Google Research 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu
More informationExtendedHorizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network
2013 European Control Conference (ECC) July 1719, 2013, Zürich, Switzerland. ExtendedHorizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona
More informationOn Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes
On Coputing Nearest Neighbors with Applications to Decoding of Binary Linear Codes Alexander May and Ilya Ozerov Horst Görtz Institute for ITSecurity RuhrUniversity Bochu, Gerany Faculty of Matheatics
More informationStable Learning in Coding Space for MultiClass Decoding and Its Extension for MultiClass Hypothesis Transfer Learning
Stable Learning in Coding Space for MultiClass Decoding and Its Extension for MultiClass Hypothesis Transfer Learning Bang Zhang, Yi Wang 2, Yang Wang, Fang Chen 2 National ICT Australia 2 School of
More informationMedia Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation
Media Adaptation Fraework in Biofeedback Syste for Stroke Patient Rehabilitation Yinpeng Chen, Weiwei Xu, Hari Sundara, Thanassis Rikakis, ShengMin Liu Arts, Media and Engineering Progra Arizona State
More informationPricing Asian Options using Monte Carlo Methods
U.U.D.M. Project Report 9:7 Pricing Asian Options using Monte Carlo Methods Hongbin Zhang Exaensarbete i ateatik, 3 hp Handledare och exainator: Johan Tysk Juni 9 Departent of Matheatics Uppsala University
More informationSoftware Quality Characteristics Tested For Mobile Application Development
Thesis no: MGSE201502 Software Quality Characteristics Tested For Mobile Application Developent Literature Review and Epirical Survey WALEED ANWAR Faculty of Coputing Blekinge Institute of Technology
More informationComment on On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes
Coent on On Discriinative vs. Generative Classifiers: A Coparison of Logistic Regression and Naive Bayes JingHao Xue (jinghao@stats.gla.ac.uk) and D. Michael Titterington (ike@stats.gla.ac.uk) Departent
More informationAnalyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy
Vol. 9, No. 5 (2016), pp.303312 http://dx.doi.org/10.14257/ijgdc.2016.9.5.26 Analyzing Spatioteporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy Chen Yang, Renjie Zhou
More informationAnalysis of the purchase option of computers
Analysis of the of coputers N. Ahituv and I. Borovits Faculty of Manageent, The Leon Recanati Graduate School of Business Adinistration, TelAviv University, University Capus, RaatAviv, TelAviv, Israel
More informationA Scalable Application Placement Controller for Enterprise Data Centers
W WWW 7 / Track: Perforance and Scalability A Scalable Application Placeent Controller for Enterprise Data Centers Chunqiang Tang, Malgorzata Steinder, Michael Spreitzer, and Giovanni Pacifici IBM T.J.
More informationManaging Complex Network Operation with Predictive Analytics
Managing Coplex Network Operation with Predictive Analytics Zhenyu Huang, Pak Chung Wong, Patrick Mackey, Yousu Chen, Jian Ma, Kevin Schneider, and Frank L. Greitzer Pacific Northwest National Laboratory
More informationRECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure
RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION Henrik Kure Dina, Danish Inforatics Network In the Agricultural Sciences Royal Veterinary and Agricultural University
More informationModeling operational risk data reported above a timevarying threshold
Modeling operational risk data reported above a tievarying threshold Pavel V. Shevchenko CSIRO Matheatical and Inforation Sciences, Sydney, Locked bag 7, North Ryde, NSW, 670, Australia. eail: Pavel.Shevchenko@csiro.au
More informationINTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS
Artificial Intelligence Methods and Techniques for Business and Engineering Applications 210 INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE
More informationFast large scale Gaussian process regression using approximate matrixvector products
Fast large scale Gaussian process regression using approxiate atrixvector products Vikas C. Raykar and Raani Duraiswai Departent of Coputer Science and Institute for advanced coputer studies University
More informationPhysics 211: Lab Oscillations. Simple Harmonic Motion.
Physics 11: Lab Oscillations. Siple Haronic Motion. Reading Assignent: Chapter 15 Introduction: As we learned in class, physical systes will undergo an oscillatory otion, when displaced fro a stable equilibriu.
More informationAUC Optimization vs. Error Rate Minimization
AUC Optiization vs. Error Rate Miniization Corinna Cortes and Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florha Park, NJ 0793, USA {corinna, ohri}@research.att.co Abstract The area under an ROC
More informationMarkovian inventory policy with application to the paper industry
Coputers and Cheical Engineering 26 (2002) 1399 1413 www.elsevier.co/locate/copcheeng Markovian inventory policy with application to the paper industry K. Karen Yin a, *, Hu Liu a,1, Neil E. Johnson b,2
More informationCapacity of MultipleAntenna Systems With Both Receiver and Transmitter Channel State Information
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO., OCTOBER 23 2697 Capacity of MultipleAntenna Systes With Both Receiver and Transitter Channel State Inforation Sudharan K. Jayaweera, Student Meber,
More informationASIC Design Project Management Supported by Multi Agent Simulation
ASIC Design Project Manageent Supported by Multi Agent Siulation Jana Blaschke, Christian Sebeke, Wolfgang Rosenstiel Abstract The coplexity of Application Specific Integrated Circuits (ASICs) is continuously
More informationAudio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA
Audio Engineering Society Convention Paper Presented at the 119th Convention 2005 October 7 10 New York, New York USA This convention paper has been reproduced fro the authors advance anuscript, without
More informationThis paper studies a rental firm that offers reusable products to price and qualityofservice sensitive
MANUFACTURING & SERVICE OPERATIONS MANAGEMENT Vol., No. 3, Suer 28, pp. 429 447 issn 523464 eissn 5265498 8 3 429 infors doi.287/so.7.8 28 INFORMS INFORMS holds copyright to this article and distributed
More informationFactored Models for Probabilistic Modal Logic
Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence (2008 Factored Models for Probabilistic Modal Logic Afsaneh Shirazi and Eyal Air Coputer Science Departent, University of Illinois
More informationCLOSEDLOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY
CLOSEDLOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY Y. T. Chen Departent of Industrial and Systes Engineering Hong Kong Polytechnic University, Hong Kong yongtong.chen@connect.polyu.hk
More informationThe AGA Evaluating Model of Customer Loyalty Based on Ecommerce Environment
6 JOURNAL OF SOFTWARE, VOL. 4, NO. 3, MAY 009 The AGA Evaluating Model of Custoer Loyalty Based on Ecoerce Environent Shaoei Yang Econoics and Manageent Departent, North China Electric Power University,
More informationStochastic Online Scheduling on Parallel Machines
Stochastic Online Scheduling on Parallel Machines Nicole Megow 1, Marc Uetz 2, and Tark Vredeveld 3 1 Technische Universit at Berlin, Institut f ur Matheatik, Strasse des 17. Juni 136, 10623 Berlin, Gerany
More informationAdaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel
Recent Advances in Counications Adaptive odulation and Coding for Unanned Aerial Vehicle (UAV) Radio Channel Airhossein Fereidountabar,Gian Carlo Cardarilli, Rocco Fazzolari,Luca Di Nunzio Abstract In
More informationPerformance Evaluation of Machine Learning Techniques using Software Cost Drivers
Perforance Evaluation of Machine Learning Techniques using Software Cost Drivers Manas Gaur Departent of Coputer Engineering, Delhi Technological University Delhi, India ABSTRACT There is a treendous rise
More informationExploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2
Exploiting Hardware Heterogeneity within the Sae Instance Type of Aazon EC2 Zhonghong Ou, Hao Zhuang, Jukka K. Nurinen, Antti YläJääski, Pan Hui Aalto University, Finland; Deutsch Teleko Laboratories,
More informationModeling Cooperative Gene Regulation Using Fast Orthogonal Search
8 The Open Bioinforatics Journal, 28, 2, 889 Open Access odeling Cooperative Gene Regulation Using Fast Orthogonal Search Ian inz* and ichael J. Korenberg* Departent of Electrical and Coputer Engineering,
More informationThe Virtual Spring Mass System
The Virtual Spring Mass Syste J. S. Freudenberg EECS 6 Ebedded Control Systes Huan Coputer Interaction A force feedbac syste, such as the haptic heel used in the EECS 6 lab, is capable of exhibiting a
More informationFactor Model. Arbitrage Pricing Theory. Systematic Versus NonSystematic Risk. Intuitive Argument
Ross [1],[]) presents the aritrage pricing theory. The idea is that the structure of asset returns leads naturally to a odel of risk preia, for otherwise there would exist an opportunity for aritrage profit.
More informationData Streaming Algorithms for Estimating Entropy of Network Traffic
Data Streaing Algoriths for Estiating Entropy of Network Traffic Ashwin Lall University of Rochester Vyas Sekar Carnegie Mellon University Mitsunori Ogihara University of Rochester Jun (Ji) Xu Georgia
More informationA quantum secret ballot. Abstract
A quantu secret ballot Shahar Dolev and Itaar Pitowsky The Edelstein Center, Levi Building, The Hebrerw University, Givat Ra, Jerusale, Israel Boaz Tair arxiv:quantph/060087v 8 Mar 006 Departent of Philosophy
More informationModified Latin Hypercube Sampling Monte Carlo (MLHSMC) Estimation for Average Quality Index
Analog Integrated Circuits and Signal Processing, vol. 9, no., April 999. Abstract Modified Latin Hypercube Sapling Monte Carlo (MLHSMC) Estiation for Average Quality Index Mansour Keraat and Richard Kielbasa
More informationPERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO
Bulletin of the Transilvania University of Braşov Series I: Engineering Sciences Vol. 4 (53) No.  0 PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO V. CAZACU I. SZÉKELY F. SANDU 3 T. BĂLAN Abstract:
More informationModeling Parallel Applications Performance on Heterogeneous Systems
Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela AlJaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln
More informationQuality evaluation of the modelbased forecasts of implied volatility index
Quality evaluation of the odelbased forecasts of iplied volatility index Katarzyna Łęczycka 1 Abstract Influence of volatility on financial arket forecasts is very high. It appears as a specific factor
More information( C) CLASS 10. TEMPERATURE AND ATOMS
CLASS 10. EMPERAURE AND AOMS 10.1. INRODUCION Boyle s understanding of the pressurevolue relationship for gases occurred in the late 1600 s. he relationships between volue and teperature, and between
More informationEnergy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms
Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and igration algoriths Chaia Ghribi, Makhlouf Hadji and Djaal Zeghlache Institut MinesTéléco, Téléco SudParis UMR CNRS 5157 9, Rue
More informationEvaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects
Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects Lucas Grèze Robert Pellerin Nathalie Perrier Patrice Leclaire February 2011 CIRRELT201111 Bureaux
More informationparallelmcmccombine: An R Package for Bayesian Methods for Big Data and Analytics
parallelcccobine: An R Package for Bayesian ethods for Big Data and Analytics Alexey iroshnikov 1, Erin. Conlon 1* 1 Departent of atheatics and Statistics, University of assachusetts, Aherst, assachusetts,
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationSupport Vector Machine Soft Margin Classifiers: Error Analysis
Journal of Machine Learning Research? (2004)??? Subitted 9/03; Published??/04 Support Vector Machine Soft Margin Classifiers: Error Analysis DiRong Chen Departent of Applied Matheatics Beijing University
More informationReal Time Target Tracking with Binary Sensor Networks and Parallel Computing
Real Tie Target Tracking with Binary Sensor Networks and Parallel Coputing Hong Lin, John Rushing, Sara J. Graves, Steve Tanner, and Evans Criswell Abstract A parallel real tie data fusion and target tracking
More informationOnline Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that
Online Appendix I: A Model of Household Bargaining ith Violence In this appendix I develop a siple odel of household bargaining that incorporates violence and shos under hat assuptions an increase in oen
More informationMINIMUM VERTEX DEGREE THRESHOLD FOR LOOSE HAMILTON CYCLES IN 3UNIFORM HYPERGRAPHS
MINIMUM VERTEX DEGREE THRESHOLD FOR LOOSE HAMILTON CYCLES IN 3UNIFORM HYPERGRAPHS JIE HAN AND YI ZHAO Abstract. We show that for sufficiently large n, every 3unifor hypergraph on n vertices with iniu
More informationPreferencebased Search and Multicriteria Optimization
Fro: AAAI02 Proceedings. Copyright 2002, AAAI (www.aaai.org). All rights reserved. Preferencebased Search and Multicriteria Optiization Ulrich Junker ILOG 1681, route des Dolines F06560 Valbonne ujunker@ilog.fr
More informationIEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, ACCEPTED FOR PUBLICATION 1. Secure Wireless Multicast for DelaySensitive Data via Network Coding
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, ACCEPTED FOR PUBLICATION 1 Secure Wireless Multicast for DelaySensitive Data via Network Coding Tuan T. Tran, Meber, IEEE, Hongxiang Li, Senior Meber, IEEE,
More informationABSTRACT KEYWORDS. Comonotonicity, dependence, correlation, concordance, copula, multivariate. 1. INTRODUCTION
MEASURING COMONOTONICITY IN MDIMENSIONAL VECTORS BY INGE KOCH AND ANN DE SCHEPPER ABSTRACT In this contribution, a new easure of coonotonicity for diensional vectors is introduced, with values between
More informationDynamic Placement for Clustered Web Applications
Dynaic laceent for Clustered Web Applications A. Karve, T. Kibrel, G. acifici, M. Spreitzer, M. Steinder, M. Sviridenko, and A. Tantawi IBM T.J. Watson Research Center {karve,kibrel,giovanni,spreitz,steinder,sviri,tantawi}@us.ib.co
More informationADJUSTING FOR QUALITY CHANGE
ADJUSTING FOR QUALITY CHANGE 7 Introduction 7.1 The easureent of changes in the level of consuer prices is coplicated by the appearance and disappearance of new and old goods and services, as well as changes
More informationPartitioned EliasFano Indexes
Partitioned Eliasano Indexes Giuseppe Ottaviano ISTICNR, Pisa giuseppe.ottaviano@isti.cnr.it Rossano Venturini Dept. of Coputer Science, University of Pisa rossano@di.unipi.it ABSTRACT The Eliasano
More informationImpact of Processing Costs on Service Chain Placement in Network Functions Virtualization
Ipact of Processing Costs on Service Chain Placeent in Network Functions Virtualization Marco Savi, Massio Tornatore, Giacoo Verticale Dipartiento di Elettronica, Inforazione e Bioingegneria, Politecnico
More informationESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET
ESTIMATING LIQUIDITY PREMIA IN THE SPANISH GOVERNMENT SECURITIES MARKET Francisco Alonso, Roberto Blanco, Ana del Río and Alicia Sanchis Banco de España Banco de España Servicio de Estudios Docuento de
More informationGenerating Certification Authority Authenticated Public Keys in Ad Hoc Networks
SECURITY AND COMMUNICATION NETWORKS Published online in Wiley InterScience (www.interscience.wiley.co). Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks G. Kounga 1, C. J.
More informationModels and Algorithms for Stochastic Online Scheduling 1
Models and Algoriths for Stochastic Online Scheduling 1 Nicole Megow Technische Universität Berlin, Institut für Matheatik, Strasse des 17. Juni 136, 10623 Berlin, Gerany. eail: negow@ath.tuberlin.de
More informationReconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)
Sandia is a ultiprogra laboratory operated by Sandia Corporation, a Lockheed Martin Copany, Reconnect 04 Solving Integer Progras with Branch and Bound (and Branch and Cut) Cynthia Phillips (Sandia National
More informationA magnetic Rotor to convert vacuumenergy into mechanical energy
A agnetic Rotor to convert vacuuenergy into echanical energy Claus W. Turtur, University of Applied Sciences BraunschweigWolfenbüttel Abstract Wolfenbüttel, Mai 21 2008 In previous work it was deonstrated,
More informationConstruction Economics & Finance. Module 3 Lecture1
Depreciation: Construction Econoics & Finance Module 3 Lecture It represents the reduction in arket value of an asset due to age, wear and tear and obsolescence. The physical deterioration of the asset
More informationPartitioning Data on Features or Samples in CommunicationEfficient Distributed Optimization?
Partitioning Data on Features or Saples in CounicationEfficient Distributed Optiization? Chenxin Ma Industrial and Systes Engineering Lehigh University, USA ch54@lehigh.edu Martin Taáč Industrial and
More informationGaussian Processes for Regression: A Quick Introduction
Gaussian Processes for Regression A Quick Introduction M Ebden, August 28 Coents to arkebden@engoacuk MOTIVATION Figure illustrates a typical eaple of a prediction proble given soe noisy observations of
More informationEfficient Key Management for Secure Group Communications with Bursty Behavior
Efficient Key Manageent for Secure Group Counications with Bursty Behavior Xukai Zou, Byrav Raaurthy Departent of Coputer Science and Engineering University of NebraskaLincoln Lincoln, NE68588, USA Eail:
More informationHOW CLOSE ARE THE OPTION PRICING FORMULAS OF BACHELIER AND BLACKMERTONSCHOLES?
HOW CLOSE ARE THE OPTION PRICING FORMULAS OF BACHELIER AND BLACKMERTONSCHOLES? WALTER SCHACHERMAYER AND JOSEF TEICHMANN Abstract. We copare the option pricing forulas of Louis Bachelier and BlackMertonScholes
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationOn site Power Transformer demagnetization. Girish Narayana Haefely, Switzerland S14
Girish Narayana Haefely, Switzerland 1 Technique for onsite of power transforers using a portable device, basics and validation by frequency response analysis Raja Kuppusway 1* Santiago González 1** 1
More informationLecture L9  Linear Impulse and Momentum. Collisions
J. Peraire, S. Widnall 16.07 Dynaics Fall 009 Version.0 Lecture L9  Linear Ipulse and Moentu. Collisions In this lecture, we will consider the equations that result fro integrating Newton s second law,
More informationPosition Auctions and Nonuniform Conversion Rates
Position Auctions and Nonunifor Conversion Rates Liad Blurosen Microsoft Research Mountain View, CA 944 liadbl@icrosoft.co Jason D. Hartline Shuzhen Nong Electrical Engineering and Microsoft AdCenter
More informationMarkov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center
Markov Models and Their Use for Calculations of Iportant Traffic Paraeters of Contact Center ERIK CHROMY, JAN DIEZKA, MATEJ KAVACKY Institute of Telecounications Slovak University of Technology Bratislava
More informationExample: Suppose that we deposit $1000 in a bank account offering 3% interest, compounded monthly. How will our money grow?
Finance 111 Finance We have to work with oney every day. While balancing your checkbook or calculating your onthly expenditures on espresso requires only arithetic, when we start saving, planning for retireent,
More informationRegression Using Support Vector Machines: Basic Foundations
Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering
More informationREQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES
REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES Charles Reynolds Christopher Fox reynolds @cs.ju.edu fox@cs.ju.edu Departent of Coputer
More information