AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Transcription

1 AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output learning For slides and up-to-date information: / 64 Outline Outline Beyond classification: structured output SVMs Beyond classification: structured output SVMs Learning formulations Learning formulations Optimisation Optimisation A complete example A complete example Further insights on optimisation Further insights on optimisation 3 / 64 4 / 64

2 Beyond classification Support Vector Regression / Consider now the general problem of learning a function f : X Y, x y, where both the input and output spaces are general. Examples: Ranking. given a set of objects (o,..., o k ) as input x, return a order as output y. Pose estimation. given an image of a human as input x, return the parameters (p,..., p k ) of his/her pose as output y. mage segmentation. given an image from Flikr as input x, return a mask highlighting the foreground object as output y. A real function R d R can be approximated directly by the SVM score: f (x) w, Φ(x). Think of the feature map Φ(x) as a collection of basis functions. For instance, if x R, one can use the basis of second order polynomials: Φ(x) = [ x x ] w, Φ(x) = w + w x + 3 x. The goal is to find w (e.g. polynomial coefficients) such that the score fits the example data w, Φ(x i ) y i by minimising the L error L i (w) = y i w, Φ(x i ). Support Vector Regression / 5 / 64 A general approach: learning the graph 6 / 64 SVR is just a variant of regularised regressions: method loss regul. objective function SVR l l n n i= y i w Φ(x i ) + λ w least square l n none n i= (y i w Φ(x i )) ridge regression l l n n i= (y i w Φ(x i )) + λ w lassoo l l n n i= (y i w Φ(x i )) + λ w An aside: ɛ-insensitive L loss Limitation: only real functions! Actually, SVR makes use of a slightly more general loss Use a binary SVM to classify which pairs (x, y) X Y belongs to the graph of the function (treat the output as an input!): Joint feature map y = f (x) w, Ψ(x, y) >. n order to classify pairs (x, y), these must be encoded as vectors. To this end, we need a joint feature map: Φ : (x, y) Φ(x, y) R d As long as this feature can be designed, the nature of x and y is irrelevant. L i (w) = max{, y i w, x i ɛ} which is insensitive to error below a threshold ɛ. One can set ɛ = though [Smola and Scholkopf, 4]. 7 / 64 8 / 64

3 y y y Example: learning the graph of a real function / Algorithm: learned function f - x scoring function - x. Start from the true pairs (x i, y i ) (green squares) where the graph should pass.. Add many false pairs (x i, y i ) (red dots) where the graph should not pass. 3. Learn a scoring function w, Ψ(x, y) to fit these points. 4. Define the learned function graph to be the collection of points such that w, Ψ(x, y) > (green areas) learned function f - x The good and the bad scoring function - x The good: works for any type of inputs and outputs! (Not just real functions.) The Bad: Not one-to-one. For each x, there are multiple outputs y with positive score. Not complete. There are x for which all the outputs have negative score. Very large negative example set / 64 / 64 Example: learning the graph of a real function / learned function f - x scoring function - x n this example the joint feature map is a Fourier basis (note the ringing!) cos(f x x + f y y + φ ) cos(f x x + f y y + φ ) Ψ(x, y) =., for appropriate (f i, f i, φ i ). cos(f dx x + f dy y + φ d ) Structured output SVMs Structured output SVM. ssues and can be fixed by choosing the highest scoring output for each input: ntuition The scoring function ŷ(x; w) = argmax w, Ψ(x, y) w, Ψ(x, y) is somewhat analogous to a posterior probability density function P(y x) but it does not have any probabilistic meaning. / 64 / 64

4 Example: real function learned function f scoring function nference problem column rescaled nference problem. Evaluating a structured SVM requires solving the problem argmaxhw, Ψ(x, y)i. - The efficiency of using a structured SVM (after learning) depends on how quickly the inference problem can be solved. f (x) = y that maximises the score along column x. f (x) is now uniquely and completely defined. Note: only the relative values of the score along a column really matter (see rescaled version on the right). 3 / 64 4 / 64 Example: binary linear SVM Example: object localisation Let x be an image and y Y R4 a rectangular window. The goal is to find the window containing a given object. Standard SVMs can be easily interpreted as a structured SVMs: Output space: y Y = {, +}. w Feature map: Ψ(x, y ) = (x, y) y x. nference: restriction visual features x y (x, y) Rd Let x y denote an image window (crop). Standard SVM: score one window: Φ(x y ) = histogram of SFT features, y hw, xi = signhw, xi. y {,+} y (x; w) = argmax hw, Φ(x y )i = window score. Structured SVM: try all windows and pick the best one: y (x; w) = argmaxhw, Ψ(x, y)i = argmaxhw, Φ(x y )i. 5 / 64 6 / 64

5 Example: pose estimation Example: ranking / Let x be an image and y = (p, p, p3, p4, p5 ) the pose of a human, expressed as the D location of five parts. (x p ) (p, x p p ) x p (x p ) (x p ) (x p ) (x ) (x, y) = 6 p5 7 6 (p, p ) (p5, p6 ) Consider the problem of ranking a list of objects x = (o,..., on ) (input). The output y is an ranking (total order). This can be represented as a matrix y such that yij = +, oi has higher rank than oj, yij =, otherwise. A joint feature map for ranking Ψ(x, y) = X ij yij hφ(oi ) Φ(oj ), wi. nituition The score hw, Ψ(x, y)i reflects how well the five image parts match their appearance models and whether the deformation is reasonable or not. 8 / 64 7 / 64 Example: ranking / Outline This structured SVM ranks the objects by decreasing score hφ(oi ), wi: y ij (x; w) = sign hφ(oi ), wi hφ(oj ), wi. Beyond classification: structured output SVMs n fact the score of this output X hw, Ψ(x, y (x; w))i = yij hφ(oi ) Φ(oj ), wi Learning formulations ij = X ij = X ij signhφ(oi ) Φ(oj ), wihφ(oi ) Φ(oj ), wi Optimisation hφ(oi ) Φ(oj ), wi A complete example is maximum. Further insights on optimisation 9 / 64 / 64

6 Summary so far and what remains to be done Learning formulation / nput-output relation The SVM defines an input-output relation based on maximising the joint score: ŷ(x; w) = argmax w, Ψ(x, y). Next: how to fit the input-output relation to data. Given n example input-output pairs (x, y ),..., (x n, y n ), find w such that the structured SVM approximately fit them ŷ(x i ; w) y i, i =,..., n, while controlling the complexity of the estimated function. Objective function (non-convex) E (w) = λ w + n (y i, ŷ(x i ; w)) i= Notation reminder: is the loss function, ŷ the output estimated by the SVM, y i the ground truth output, and x i the ground truth input. Loss function / 64 Example: a ranking loss / 64 The loss function measures the fit quality: (y, ŷ) such that (y, ŷ) and (y, ŷ) = if, and only if, y = ŷ. Examples: For a binary SVM the loss is (y, ŷ) = {, y ŷ,, otherwise. n object localisation the loss could be one minus the ratio of the areas of the intersection and union of the rectangles y and ŷ: n ranking... (y, ŷ) = y ŷ y ŷ. n ranking, suitable losses include the ROC-AUC, the precision-recall AUC, k,... The ROC curve plots the true positive rate against the true negative rate. true positive rate Area under ROC = true negative rate Given the true ranking y and the estimated ŷ, we can define (y, ŷ) = ROCAUC(y, ŷ) One can show that this is simply the number of incorrectly ranked pairs, i.e. (y, ŷ) = n [y ij ŷ ij ] i,j= 3 / 64 4 / 64

7 Learning formulation / The surrogate loss The goal of learning is to find the minimiser w of: E (w) = λ w + n (y i, ŷ(x i ; w)), i= where ŷ(x i ; w) = argmax w, Φ(x i, y). The dependency of the loss on w is very complex: is non-convex and is composed with argmax! Objective function (convex) Given a convex surrogate loss L i (w) (y i, ŷ(x i ; w)) we consider the objective E(w) = λ w + L i (w). n i= The key in the success of the structured SVMs is the existence of good surrogates. There are standard constructions that work well in a variety of cases (but not always!). The aim is to make minimising L i (w) have the same effect as minimising (y i, ŷ(x i ; w)). Bounding property: (y i, ŷ(x i ; w)) L i (w). Tightness f we can find w s.t. L i (w ) =, then (y i, y(x i ; w )) =. But can we? Not always! Consider setting L i (w) = very large constant. We need a tight bound. E.g.: (y i, y(x i ; w )) = L i (w ) =. Margin rescaling surrogate 5 / 64 Margin condition 6 / 64 Margin rescaling is the first standard surrogate construction: L i (w) = sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. This surrogate bounds the loss: (y i, ŷ(x i ; w)) (y i, ŷ(x i ; w)) + because ŷ(x i ; w) maximises the score by definition. { }} { Ψ(x i, ŷ(x i ; w)), w Ψ(x i, y i ), w sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w = L i (w) s margin rescaling a tight approximation? The following margin condition holds score of g.t. output score of any other output margin { }} { { }} { { }} { L i (w ) = y Y : Ψ(x i, y i ), w Ψ(x i, y), w + (y i, y) Tightness The surrogate is not tight in the sense above: (y i, y(x i ; w )) = L i (w ) =. n order to minimise the surrogate, the more stringent margin condition has to be satisfied! But this is usually good enough, and in fact beneficial (implies robustness). 7 / 64 8 / 64

8 Slack rescaling surrogate Augmented inference Slack rescaling is the second standard surrogate construction: L i (w) = sup (y i, y) [ + Ψ(x i, y), w Ψ(x i, y i ), w ]. May give better results than marging rescaling. However, it is often significantly harder to treat in calculations. The margin condition is L i (w ) = y y i : score of g.t. output score of any other output margin { }} { { }} { {}}{ Ψ(x i, y i ), w Ψ(x i, y), w + Evaluating the objective E(w) requires computing the supremum in the augment loss sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. Maximising this quantity is the augmented inference problem due to its similarity with the inference problem max Ψ(x i, y), w Augmented inference can be significantly harder than inference, especially for slack rescaling. Example: binary linear SVM 9 / 64 The good and the bad of convex surrogates 3 / 64 Recall that for a binary linear SVM: Y = {, +}, Ψ(x, y) = y x, (y i, ŷ) = [y i y]. Then in the margin rescaling construction, solving the augmented inference problem yields L i (w) = sup [y i y] + y y {,} x iw y i x i, w = max [y i y] + y y i x i, w y { y i,y i } = max{, y i x i, w }, Good: Convex surrogates separate the ground truth outputs y i from other outputs y by a margin modulated by the loss. Bad: Despite their construction, they can be poor approximations of the original loss. They are unimodal, and therefore cannot model situations in which different outputs are equally acceptable. f the ground truth y i is not separable, they may be incapable of identifying which is the best output that can actually be achieved instead no graceful fallback. i.e. the same loss of a standard SVM. n this case, slack rescaling yields the same result. 3 / 64 3 / 64

9 Outline Summary so far and what remains to be done Beyond classification: structured output SVMs Learning formulations nput-output relation The SVM defines an input-output relation based on maximising the joint score: ŷ(x; w) = argmax w, Ψ(x, y). Optimisation A complete example Further insights on optimisation Convex surrogate objective The joint score can be designed to fit the data (x, y ),..., (x n, y n ) by optimising E(w) = λ w + L i (w). n i= Next: how to solve this optimisation problem. A (naive) direct approach / 33 / 64 A (naive) direct approach / 34 / 64 Learning a structured SVM requires solving an optimisation problem of the type: E(w) = λ w + n L i (w), i= L i (w) = sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. This problem can be rewritten as a constrained quadratic program in the parameters w and the slack variables ξ: E(w, ξ) = λ w + n ξ i, i= ξ i b iy a iy, w i =,..., n, y Y. Can we use a standard quadratic solver (e.g. quadprog in MATLAB)? More in general, this can be rewritten as E(w) = λ w + n L i (w), i= L i (w) = sup b iy a iy, w. The size of this problem There is one set of constraints for each data point (x i, y i ). Each set of constraints contains one linear constraint for each output y. Way too large (even infinite!) to be directly fed to a quadratic solver. 35 / / 64

10 A second look Subgradient and subdifferential Let s look again to the original problem is a slightly different form: E(w) = λ w + L(w), L(w) = n i= sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. L(w) g L(w) is a convex, non-smooth function, with bounded Lipschitz constant (i.e., it does not vary too fast). Optimisation of such functions is extensively studied in operational research. We are going to discuss the Bundle Method for Regularized Risk Minimization (BMRM) method, a special case of bundle method for regularised loss functions, which in turns is a stabilised variant of cutting Assumption: L(w) convex, not necessarily smooth, with bounded Lipschitz constant G. A subgradient of L(w) at w is any vector g such that g G. w : L(w ) L(w) + g, w w. The subdifferential L(w) is the set of all subgradients and contains only the gradient L(w) if the function is differentiable. w Cutting planes 37 / 64 Cutting plane algorithm 38 / 64 L(w) L (t) (w) Goal: minimize a convex non-necessarily smooth function L(w). Method: incrementally construct a lower approximation L (t) (w). At each iteration, minimise the latter to obtain w t and add a cutting plane at that point. Cutting plane algorithm w Given a point w, we approximate the convex L(w) from below by a tangent plane: L(w) b a, w, a L(w ) b = L(w ) + a, w. (a, b) is the cutting plane at w. Given the cutting planes at w,..., w t, we define the lower approximation L (t) (w) = max i=,...,t b i a i, w. w Start with w = and t =. Then repeat:. t t +.. Get a cutting plane (a t, b t ) by computing the subgradient of L(w) at w t. 3. Add the plane to the current approximation L (t) (w). 4. Set w t = argmin w L (t) (w). 5. f L(w t ) L (t) (w t ) < ɛ stop as converged. [Kiwiel, 99, Lemaréchal et al., 995, Joachims et al., 9] 39 / 64 4 / 64

11 Guarantees at convergence L(t) (w) L(w) wt w Cutting plane example w The algorithm stops when L(wt ) L (wt ) <. The true optimum L(w ) is sandwiched: (t) wt minimizes L(t) w minimizes L z } { z } { L(t) (wt ) L(t) (w ) L(w ) L(wt ) {z } L(t) L Optimizing the function L(w) = w log w in the interval [., ]. Hence when the algorithm converge one has the guarantee: L(wt ) L(w ) +. 4 / 64 4 / 64 BMRM: cutting planes with a regulariser BMRM example The standard cutting plane algorithm takes forever to converge (it is not the one used for SVM...) as it can take wild steps. Bundle methods try to regularise the steps but are generally difficult to tune. BMRM notes that one has already a regulariser in the SVM objective function: λ E(w) = kwk + L(w). BMRM algorithm Start with w = and t =. Then repeat:. t t +.. Get a cutting plane (at, bt ) by computing the subgradient of L(w) at wt. 3. Add the plane to the current approximation L(t) (w). 4. Set Et (w) = λ kwk + L(t) (w). 5. Set wt = argminw Et (w). 6. f E(wt ) Et (wt ) < stop as converged Optimizing the function E(w) = w w log w in the interval [., ]. [Teo et al., 9] but also [Kiwiel, 99, Lemare chal et al., 995, Joachims et al., 9] 43 / / 64

12 Application of BMRM to structured SVMs Outline n this case: L(w) = n i= sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. L(w) is just the average of the subgradients of the terms. The subgradient g i at w of a term is computed by determining the maximally violated output ȳ i = argmax (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w, Remark. This is the augmented inference problem. Remark. Once ȳ i is obtained, the subgradient is given by g i = Ψ(x i, ȳ i ) Ψ(x i, y i ). Beyond classification: structured output SVMs Learning formulations Optimisation A complete example Further insights on optimisation Thus BMRM can be applied provided that the augmented inference problem can be solved (even when Y is infinite!). Structured SVM: fitting a real function 45 / 64 MATLAB implementation / 46 / 64 Consider the problem of learning a real function f : R [, ] by fitting points (x, y ),..., (x n, y n ). Loss Joint feature map (y, ŷ) = ŷ y. y yx Ψ(x, y) = yx yx 3. y To see why this works we will look at the resulting inference problem. First, program a callback for the loss. function delta = losscb(param, y, ybar) delta = abs(ybar - y) ; 3 end Then a callback for the feature map. function psi = featurecb(param, x, y) psi = [y ; 3 y * x ; 4 y * x^ ; 5 y * x^3 ; * y^] ; 7 psi = sparse(psi) ; 8 end 47 / / 64

13 nference Augmented inference The inference problem is ŷ(x; w) = argmax w, Ψ(x, y) y [,] = argmax y(w + w x + w 3 x + w 4 x 3 ) y [,] y w 5. Differentiate w.r.t. y and set to zero to obtain: ŷ(x; w) = w w 5 + w w 5 x + w 3 w 5 x + w 4 w 5 x 3. Note: there are some other special cases due to the fact that y [, +] and w 5 may be negative. Solving the augmented inference problem is needed to compute the value and sub-gradient of the margin-rescaling loss L i (w) = max (y, ŷ) + w, Ψ(x, y) w, Ψ(x, y i ) ŷ [,] = max ŷ y i + y(w + w x + w 3 x + w 4 x 3 ) ŷ [,] y w 5 const. The maximiser is one of at most four possibilities: { y,, z, z + } [, ], z = y(w + w x + w 3 x + w 4 x 3 ). w 5 w 5 Try the four cases and pick the one with larger augmented loss. MATLAB implementation / 49 / 64 MATLAB implementation /3 5 / 64 Finally program the augmented inference. function yhat = constraintcb(param, model, x, y) w = model.w ; 3 z = w() + w() * x + w(3) * x.^ + w(4) * x.^3 ; 4 yhat = [] ; 5 if w(5) > 6 yhat = [z -, z + ] / w(5) ; 7 yhat = max(min(yhat, ),-) ; 8 end 9 yhat = [yhat, -, ] ; aloss abs(y_ - y) + z * y_ -.5 * y_.^ * w(5) ; [drop, worse] = max(aloss(yhat)) ; 3 yhat = yhat(worse) ; 4 end Once the callbacks are coded, we use an off-the-shelf-solver ( % training examples parm.patterns = {-, -,,, } ; 3 parm.labels = {.5, -.5,.5, -.5,.5} ; 4 5 % callbacks & other parameters 6 parm.lossfn ; 7 parm.constraintfn ; 8 parm.featurefn ; 9 parm.dimension = 5 ; % call the solver and print the model model = svm_struct_learn( -c -o, parm) ; 3 model.w 5 / 64 5 / 64

14 Learning the scoring function cutting plane iteration scoring function Outline column rescaled cutting plane iteration scoring function Optimisation Learning formulations column rescaled Beyond classification: structured output SVMs After each cutting plane iteration the scoring function A complete example F (x, y) = hψ(x, y ), wi is updated. Remember: The output function is obtained by maximising the score along the columns. The relative scaling of the columns is irrelevant and rescaling them reveals the structure better. Further insights on optimisation 53 / / 64 How fast is BMRM? BMRM for structured SVMs: problem size Provably convergent to a desired approximation. BMRM decouples the data from the approximation of L(w). The convergence rates with respect to the accuracy are not bad: The number of data points n affects the cost of evaluationg L(w) and its subgradient. However, the cost of optimising L(t) (w) depends only on the iteration number t! n practice t is small and L(t) (w) may be minimised very efficiently in the dual. loss L(w) non-smooth smooth number of iterations O( ) O(log( )) accounting for λ O( λ ) O( λ log( )) Note: the convergence rate depends also on the amount of regularisation λ. Difficult learning problems (e.g. object detection) typically have large n, small λ, small. so fast convergence is not so obvious. 55 / / 64 cu

15 BMRM subproblem in the primal BMRM subproblem in the dual The problem min w λ w + L (t) (w), L (t) (w) = max i=,...,n b i a i, w reduces to the constrained quadratic program λ min w,ξ w + ξ, ξ b i a i, w, i =,..., t. Note that there is a single (scalar) slack variable. This is known as one-slack formulation. Let b = [b,..., b t ], A = [a,..., a t ] and K = A A/λ. The corresponding dual problem is where at optimum w = λ Aα. ntuition: why it is efficient max α, b α α K α, α. The original infinite constraints are approximated by just t constraints in L (t) (w). This is possible because:. The approximation needs to be good only around the optimum.. The effective dimensionality and redundancy of the data are exploited. Solving the corresponding quadratic problem is easy because t is small. Remark. BMRM is a primal solver. Switching to the dual for the subproblems is convenient but completely optional. mplementation 57 / 64 Tricks of the trade: caching / 58 / 64 An attractive aspect is the ease of implementation. A = [] ; B = [] ; 3 minimum = -inf ; 4 while getobjective(w) - minimum > epsilon 5 [a,b] = getcuttingplane(w) ; 6 A = [A, a] ; 7 B = [B, b] ; 8 [w, minimum] = quadraticsolver(lambda, A, B) ; 9 end A simple quadratic solver may do as the problem is small (e.g. MATLAB quadprog). getcuttingplane computes an average of subgradients, in turn obtained by solving the augmented inference problems. w w w 3... L (w) (a, b ) (a, b ) (a 3, b 3 )... L (w) (a, b ) (a, b ) (a 3, b 3 ).... L n (w) (a n, b ) (a n, b n ) (a n3, b n3 )... L(w) (a, b ) (a, b ) (a 3, b 3 )... For each novel w t a new constraint per example is generated by running augmented inference. The overall loss is an average of per-example losses: And so for each cutting plane: a t = n L(w) = n a it (w), i= L i (w) i= b t = n b it (w), i= 59 / 64 6 / 64

16 Tricks of the trade: caching / Tricks of the trade: caching /3 w w w 3... L (w) (a, b ) (a, b ) (a 3, b 3 )... t L (w) (a, b ) (a, b ) (a 3, b 3 )... t.. L n (w) (a n, b ) (a n, b n ) (a n3, b n3 )... tn L(w) (a, b ) (a, b ) (a 3, b 3 )... (a t+δt, b t+δt ) Caching recombines constraints generated so far to obtain a novel cutting plane without running augmented inference (expensive) [Joachims, 6, Felzenszwalb et al., 8].. For each example i =,..., n pick the most violated constraint in the cache: ti = argmax b it a it, w. t=,...,t. Now form a novel cutting plane by recombining the existing constraints: a t+δt = n i= a it i (w), b t+δt = n i= b it i (w),. Caching is very important for problems like object detection in which inference is very expensive (seconds or minutes per image). Consider for example [Felzenszwalb et al., 8] object detector. With 5 training images and five seconds / image for inference it requires an hour for one round of augmented inference! Thus the solver should be iterated until examples in the cache are correctly separated. t is pointless to fetch more before the solution has stabilised due to the huge cost. Preventive caching. During a round of inference it is also possible to return and store in the cache a small set of highly violated constraints. They may become most violated at a later iteration. Tricks of the trade: incremental training 6 / 64 Bibliography 6 / 64 Another speedup is to train the model gradually, by adding progressively more training samples. The intuition is that a lot of samples are only needed to refine the model. P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. n Proc. CVPR, 8. T. Joachims. Training linear SVMs in linear time. n Proc. KDD, 6. T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(), 9. K. C. Kiwiel. Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46, 99. C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical Programming, 69, 995. Alex J. Smola and Bernhard Scholkopf. A tutorial on support vector regression. Statistics and Computing, 4(3), 4. C. H. Teo, S. V. N. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, (55), / / 64