AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Size: px
Start display at page:

Download "AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi"

Transcription

1 AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output learning For slides and up-to-date information: / 64 Outline Outline Beyond classification: structured output SVMs Beyond classification: structured output SVMs Learning formulations Learning formulations Optimisation Optimisation A complete example A complete example Further insights on optimisation Further insights on optimisation 3 / 64 4 / 64

2 Beyond classification Support Vector Regression / Consider now the general problem of learning a function f : X Y, x y, where both the input and output spaces are general. Examples: Ranking. given a set of objects (o,..., o k ) as input x, return a order as output y. Pose estimation. given an image of a human as input x, return the parameters (p,..., p k ) of his/her pose as output y. mage segmentation. given an image from Flikr as input x, return a mask highlighting the foreground object as output y. A real function R d R can be approximated directly by the SVM score: f (x) w, Φ(x). Think of the feature map Φ(x) as a collection of basis functions. For instance, if x R, one can use the basis of second order polynomials: Φ(x) = [ x x ] w, Φ(x) = w + w x + 3 x. The goal is to find w (e.g. polynomial coefficients) such that the score fits the example data w, Φ(x i ) y i by minimising the L error L i (w) = y i w, Φ(x i ). Support Vector Regression / 5 / 64 A general approach: learning the graph 6 / 64 SVR is just a variant of regularised regressions: method loss regul. objective function SVR l l n n i= y i w Φ(x i ) + λ w least square l n none n i= (y i w Φ(x i )) ridge regression l l n n i= (y i w Φ(x i )) + λ w lassoo l l n n i= (y i w Φ(x i )) + λ w An aside: ɛ-insensitive L loss Limitation: only real functions! Actually, SVR makes use of a slightly more general loss Use a binary SVM to classify which pairs (x, y) X Y belongs to the graph of the function (treat the output as an input!): Joint feature map y = f (x) w, Ψ(x, y) >. n order to classify pairs (x, y), these must be encoded as vectors. To this end, we need a joint feature map: Φ : (x, y) Φ(x, y) R d As long as this feature can be designed, the nature of x and y is irrelevant. L i (w) = max{, y i w, x i ɛ} which is insensitive to error below a threshold ɛ. One can set ɛ = though [Smola and Scholkopf, 4]. 7 / 64 8 / 64

3 y y y Example: learning the graph of a real function / Algorithm: learned function f - x scoring function - x. Start from the true pairs (x i, y i ) (green squares) where the graph should pass.. Add many false pairs (x i, y i ) (red dots) where the graph should not pass. 3. Learn a scoring function w, Ψ(x, y) to fit these points. 4. Define the learned function graph to be the collection of points such that w, Ψ(x, y) > (green areas) learned function f - x The good and the bad scoring function - x The good: works for any type of inputs and outputs! (Not just real functions.) The Bad: Not one-to-one. For each x, there are multiple outputs y with positive score. Not complete. There are x for which all the outputs have negative score. Very large negative example set / 64 / 64 Example: learning the graph of a real function / learned function f - x scoring function - x n this example the joint feature map is a Fourier basis (note the ringing!) cos(f x x + f y y + φ ) cos(f x x + f y y + φ ) Ψ(x, y) =., for appropriate (f i, f i, φ i ). cos(f dx x + f dy y + φ d ) Structured output SVMs Structured output SVM. ssues and can be fixed by choosing the highest scoring output for each input: ntuition The scoring function ŷ(x; w) = argmax w, Ψ(x, y) w, Ψ(x, y) is somewhat analogous to a posterior probability density function P(y x) but it does not have any probabilistic meaning. / 64 / 64

4 Example: real function learned function f scoring function nference problem column rescaled nference problem. Evaluating a structured SVM requires solving the problem argmaxhw, Ψ(x, y)i. - The efficiency of using a structured SVM (after learning) depends on how quickly the inference problem can be solved. f (x) = y that maximises the score along column x. f (x) is now uniquely and completely defined. Note: only the relative values of the score along a column really matter (see rescaled version on the right). 3 / 64 4 / 64 Example: binary linear SVM Example: object localisation Let x be an image and y Y R4 a rectangular window. The goal is to find the window containing a given object. Standard SVMs can be easily interpreted as a structured SVMs: Output space: y Y = {, +}. w Feature map: Ψ(x, y ) = (x, y) y x. nference: restriction visual features x y (x, y) Rd Let x y denote an image window (crop). Standard SVM: score one window: Φ(x y ) = histogram of SFT features, y hw, xi = signhw, xi. y {,+} y (x; w) = argmax hw, Φ(x y )i = window score. Structured SVM: try all windows and pick the best one: y (x; w) = argmaxhw, Ψ(x, y)i = argmaxhw, Φ(x y )i. 5 / 64 6 / 64

5 Example: pose estimation Example: ranking / Let x be an image and y = (p, p, p3, p4, p5 ) the pose of a human, expressed as the D location of five parts. (x p ) (p, x p p ) x p (x p ) (x p ) (x p ) (x ) (x, y) = 6 p5 7 6 (p, p ) (p5, p6 ) Consider the problem of ranking a list of objects x = (o,..., on ) (input). The output y is an ranking (total order). This can be represented as a matrix y such that yij = +, oi has higher rank than oj, yij =, otherwise. A joint feature map for ranking Ψ(x, y) = X ij yij hφ(oi ) Φ(oj ), wi. nituition The score hw, Ψ(x, y)i reflects how well the five image parts match their appearance models and whether the deformation is reasonable or not. 8 / 64 7 / 64 Example: ranking / Outline This structured SVM ranks the objects by decreasing score hφ(oi ), wi: y ij (x; w) = sign hφ(oi ), wi hφ(oj ), wi. Beyond classification: structured output SVMs n fact the score of this output X hw, Ψ(x, y (x; w))i = yij hφ(oi ) Φ(oj ), wi Learning formulations ij = X ij = X ij signhφ(oi ) Φ(oj ), wihφ(oi ) Φ(oj ), wi Optimisation hφ(oi ) Φ(oj ), wi A complete example is maximum. Further insights on optimisation 9 / 64 / 64

6 Summary so far and what remains to be done Learning formulation / nput-output relation The SVM defines an input-output relation based on maximising the joint score: ŷ(x; w) = argmax w, Ψ(x, y). Next: how to fit the input-output relation to data. Given n example input-output pairs (x, y ),..., (x n, y n ), find w such that the structured SVM approximately fit them ŷ(x i ; w) y i, i =,..., n, while controlling the complexity of the estimated function. Objective function (non-convex) E (w) = λ w + n (y i, ŷ(x i ; w)) i= Notation reminder: is the loss function, ŷ the output estimated by the SVM, y i the ground truth output, and x i the ground truth input. Loss function / 64 Example: a ranking loss / 64 The loss function measures the fit quality: (y, ŷ) such that (y, ŷ) and (y, ŷ) = if, and only if, y = ŷ. Examples: For a binary SVM the loss is (y, ŷ) = {, y ŷ,, otherwise. n object localisation the loss could be one minus the ratio of the areas of the intersection and union of the rectangles y and ŷ: n ranking... (y, ŷ) = y ŷ y ŷ. n ranking, suitable losses include the ROC-AUC, the precision-recall AUC, k,... The ROC curve plots the true positive rate against the true negative rate. true positive rate Area under ROC = true negative rate Given the true ranking y and the estimated ŷ, we can define (y, ŷ) = ROCAUC(y, ŷ) One can show that this is simply the number of incorrectly ranked pairs, i.e. (y, ŷ) = n [y ij ŷ ij ] i,j= 3 / 64 4 / 64

7 Learning formulation / The surrogate loss The goal of learning is to find the minimiser w of: E (w) = λ w + n (y i, ŷ(x i ; w)), i= where ŷ(x i ; w) = argmax w, Φ(x i, y). The dependency of the loss on w is very complex: is non-convex and is composed with argmax! Objective function (convex) Given a convex surrogate loss L i (w) (y i, ŷ(x i ; w)) we consider the objective E(w) = λ w + L i (w). n i= The key in the success of the structured SVMs is the existence of good surrogates. There are standard constructions that work well in a variety of cases (but not always!). The aim is to make minimising L i (w) have the same effect as minimising (y i, ŷ(x i ; w)). Bounding property: (y i, ŷ(x i ; w)) L i (w). Tightness f we can find w s.t. L i (w ) =, then (y i, y(x i ; w )) =. But can we? Not always! Consider setting L i (w) = very large constant. We need a tight bound. E.g.: (y i, y(x i ; w )) = L i (w ) =. Margin rescaling surrogate 5 / 64 Margin condition 6 / 64 Margin rescaling is the first standard surrogate construction: L i (w) = sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. This surrogate bounds the loss: (y i, ŷ(x i ; w)) (y i, ŷ(x i ; w)) + because ŷ(x i ; w) maximises the score by definition. { }} { Ψ(x i, ŷ(x i ; w)), w Ψ(x i, y i ), w sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w = L i (w) s margin rescaling a tight approximation? The following margin condition holds score of g.t. output score of any other output margin { }} { { }} { { }} { L i (w ) = y Y : Ψ(x i, y i ), w Ψ(x i, y), w + (y i, y) Tightness The surrogate is not tight in the sense above: (y i, y(x i ; w )) = L i (w ) =. n order to minimise the surrogate, the more stringent margin condition has to be satisfied! But this is usually good enough, and in fact beneficial (implies robustness). 7 / 64 8 / 64

8 Slack rescaling surrogate Augmented inference Slack rescaling is the second standard surrogate construction: L i (w) = sup (y i, y) [ + Ψ(x i, y), w Ψ(x i, y i ), w ]. May give better results than marging rescaling. However, it is often significantly harder to treat in calculations. The margin condition is L i (w ) = y y i : score of g.t. output score of any other output margin { }} { { }} { {}}{ Ψ(x i, y i ), w Ψ(x i, y), w + Evaluating the objective E(w) requires computing the supremum in the augment loss sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. Maximising this quantity is the augmented inference problem due to its similarity with the inference problem max Ψ(x i, y), w Augmented inference can be significantly harder than inference, especially for slack rescaling. Example: binary linear SVM 9 / 64 The good and the bad of convex surrogates 3 / 64 Recall that for a binary linear SVM: Y = {, +}, Ψ(x, y) = y x, (y i, ŷ) = [y i y]. Then in the margin rescaling construction, solving the augmented inference problem yields L i (w) = sup [y i y] + y y {,} x iw y i x i, w = max [y i y] + y y i x i, w y { y i,y i } = max{, y i x i, w }, Good: Convex surrogates separate the ground truth outputs y i from other outputs y by a margin modulated by the loss. Bad: Despite their construction, they can be poor approximations of the original loss. They are unimodal, and therefore cannot model situations in which different outputs are equally acceptable. f the ground truth y i is not separable, they may be incapable of identifying which is the best output that can actually be achieved instead no graceful fallback. i.e. the same loss of a standard SVM. n this case, slack rescaling yields the same result. 3 / 64 3 / 64

9 Outline Summary so far and what remains to be done Beyond classification: structured output SVMs Learning formulations nput-output relation The SVM defines an input-output relation based on maximising the joint score: ŷ(x; w) = argmax w, Ψ(x, y). Optimisation A complete example Further insights on optimisation Convex surrogate objective The joint score can be designed to fit the data (x, y ),..., (x n, y n ) by optimising E(w) = λ w + L i (w). n i= Next: how to solve this optimisation problem. A (naive) direct approach / 33 / 64 A (naive) direct approach / 34 / 64 Learning a structured SVM requires solving an optimisation problem of the type: E(w) = λ w + n L i (w), i= L i (w) = sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. This problem can be rewritten as a constrained quadratic program in the parameters w and the slack variables ξ: E(w, ξ) = λ w + n ξ i, i= ξ i b iy a iy, w i =,..., n, y Y. Can we use a standard quadratic solver (e.g. quadprog in MATLAB)? More in general, this can be rewritten as E(w) = λ w + n L i (w), i= L i (w) = sup b iy a iy, w. The size of this problem There is one set of constraints for each data point (x i, y i ). Each set of constraints contains one linear constraint for each output y. Way too large (even infinite!) to be directly fed to a quadratic solver. 35 / / 64

10 A second look Subgradient and subdifferential Let s look again to the original problem is a slightly different form: E(w) = λ w + L(w), L(w) = n i= sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. L(w) g L(w) is a convex, non-smooth function, with bounded Lipschitz constant (i.e., it does not vary too fast). Optimisation of such functions is extensively studied in operational research. We are going to discuss the Bundle Method for Regularized Risk Minimization (BMRM) method, a special case of bundle method for regularised loss functions, which in turns is a stabilised variant of cutting Assumption: L(w) convex, not necessarily smooth, with bounded Lipschitz constant G. A subgradient of L(w) at w is any vector g such that g G. w : L(w ) L(w) + g, w w. The subdifferential L(w) is the set of all subgradients and contains only the gradient L(w) if the function is differentiable. w Cutting planes 37 / 64 Cutting plane algorithm 38 / 64 L(w) L (t) (w) Goal: minimize a convex non-necessarily smooth function L(w). Method: incrementally construct a lower approximation L (t) (w). At each iteration, minimise the latter to obtain w t and add a cutting plane at that point. Cutting plane algorithm w Given a point w, we approximate the convex L(w) from below by a tangent plane: L(w) b a, w, a L(w ) b = L(w ) + a, w. (a, b) is the cutting plane at w. Given the cutting planes at w,..., w t, we define the lower approximation L (t) (w) = max i=,...,t b i a i, w. w Start with w = and t =. Then repeat:. t t +.. Get a cutting plane (a t, b t ) by computing the subgradient of L(w) at w t. 3. Add the plane to the current approximation L (t) (w). 4. Set w t = argmin w L (t) (w). 5. f L(w t ) L (t) (w t ) < ɛ stop as converged. [Kiwiel, 99, Lemaréchal et al., 995, Joachims et al., 9] 39 / 64 4 / 64

11 Guarantees at convergence L(t) (w) L(w) wt w Cutting plane example w The algorithm stops when L(wt ) L (wt ) <. The true optimum L(w ) is sandwiched: (t) wt minimizes L(t) w minimizes L z } { z } { L(t) (wt ) L(t) (w ) L(w ) L(wt ) {z } L(t) L Optimizing the function L(w) = w log w in the interval [., ]. Hence when the algorithm converge one has the guarantee: L(wt ) L(w ) +. 4 / 64 4 / 64 BMRM: cutting planes with a regulariser BMRM example The standard cutting plane algorithm takes forever to converge (it is not the one used for SVM...) as it can take wild steps. Bundle methods try to regularise the steps but are generally difficult to tune. BMRM notes that one has already a regulariser in the SVM objective function: λ E(w) = kwk + L(w). BMRM algorithm Start with w = and t =. Then repeat:. t t +.. Get a cutting plane (at, bt ) by computing the subgradient of L(w) at wt. 3. Add the plane to the current approximation L(t) (w). 4. Set Et (w) = λ kwk + L(t) (w). 5. Set wt = argminw Et (w). 6. f E(wt ) Et (wt ) < stop as converged Optimizing the function E(w) = w w log w in the interval [., ]. [Teo et al., 9] but also [Kiwiel, 99, Lemare chal et al., 995, Joachims et al., 9] 43 / / 64

12 Application of BMRM to structured SVMs Outline n this case: L(w) = n i= sup (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w. L(w) is just the average of the subgradients of the terms. The subgradient g i at w of a term is computed by determining the maximally violated output ȳ i = argmax (y i, y) + Ψ(x i, y), w Ψ(x i, y i ), w, Remark. This is the augmented inference problem. Remark. Once ȳ i is obtained, the subgradient is given by g i = Ψ(x i, ȳ i ) Ψ(x i, y i ). Beyond classification: structured output SVMs Learning formulations Optimisation A complete example Further insights on optimisation Thus BMRM can be applied provided that the augmented inference problem can be solved (even when Y is infinite!). Structured SVM: fitting a real function 45 / 64 MATLAB implementation / 46 / 64 Consider the problem of learning a real function f : R [, ] by fitting points (x, y ),..., (x n, y n ). Loss Joint feature map (y, ŷ) = ŷ y. y yx Ψ(x, y) = yx yx 3. y To see why this works we will look at the resulting inference problem. First, program a callback for the loss. function delta = losscb(param, y, ybar) delta = abs(ybar - y) ; 3 end Then a callback for the feature map. function psi = featurecb(param, x, y) psi = [y ; 3 y * x ; 4 y * x^ ; 5 y * x^3 ; * y^] ; 7 psi = sparse(psi) ; 8 end 47 / / 64

13 nference Augmented inference The inference problem is ŷ(x; w) = argmax w, Ψ(x, y) y [,] = argmax y(w + w x + w 3 x + w 4 x 3 ) y [,] y w 5. Differentiate w.r.t. y and set to zero to obtain: ŷ(x; w) = w w 5 + w w 5 x + w 3 w 5 x + w 4 w 5 x 3. Note: there are some other special cases due to the fact that y [, +] and w 5 may be negative. Solving the augmented inference problem is needed to compute the value and sub-gradient of the margin-rescaling loss L i (w) = max (y, ŷ) + w, Ψ(x, y) w, Ψ(x, y i ) ŷ [,] = max ŷ y i + y(w + w x + w 3 x + w 4 x 3 ) ŷ [,] y w 5 const. The maximiser is one of at most four possibilities: { y,, z, z + } [, ], z = y(w + w x + w 3 x + w 4 x 3 ). w 5 w 5 Try the four cases and pick the one with larger augmented loss. MATLAB implementation / 49 / 64 MATLAB implementation /3 5 / 64 Finally program the augmented inference. function yhat = constraintcb(param, model, x, y) w = model.w ; 3 z = w() + w() * x + w(3) * x.^ + w(4) * x.^3 ; 4 yhat = [] ; 5 if w(5) > 6 yhat = [z -, z + ] / w(5) ; 7 yhat = max(min(yhat, ),-) ; 8 end 9 yhat = [yhat, -, ] ; aloss abs(y_ - y) + z * y_ -.5 * y_.^ * w(5) ; [drop, worse] = max(aloss(yhat)) ; 3 yhat = yhat(worse) ; 4 end Once the callbacks are coded, we use an off-the-shelf-solver ( % training examples parm.patterns = {-, -,,, } ; 3 parm.labels = {.5, -.5,.5, -.5,.5} ; 4 5 % callbacks & other parameters 6 parm.lossfn ; 7 parm.constraintfn ; 8 parm.featurefn ; 9 parm.dimension = 5 ; % call the solver and print the model model = svm_struct_learn( -c -o, parm) ; 3 model.w 5 / 64 5 / 64

14 Learning the scoring function cutting plane iteration scoring function Outline column rescaled cutting plane iteration scoring function Optimisation Learning formulations column rescaled Beyond classification: structured output SVMs After each cutting plane iteration the scoring function A complete example F (x, y) = hψ(x, y ), wi is updated. Remember: The output function is obtained by maximising the score along the columns. The relative scaling of the columns is irrelevant and rescaling them reveals the structure better. Further insights on optimisation 53 / / 64 How fast is BMRM? BMRM for structured SVMs: problem size Provably convergent to a desired approximation. BMRM decouples the data from the approximation of L(w). The convergence rates with respect to the accuracy are not bad: The number of data points n affects the cost of evaluationg L(w) and its subgradient. However, the cost of optimising L(t) (w) depends only on the iteration number t! n practice t is small and L(t) (w) may be minimised very efficiently in the dual. loss L(w) non-smooth smooth number of iterations O( ) O(log( )) accounting for λ O( λ ) O( λ log( )) Note: the convergence rate depends also on the amount of regularisation λ. Difficult learning problems (e.g. object detection) typically have large n, small λ, small. so fast convergence is not so obvious. 55 / / 64 cu

15 BMRM subproblem in the primal BMRM subproblem in the dual The problem min w λ w + L (t) (w), L (t) (w) = max i=,...,n b i a i, w reduces to the constrained quadratic program λ min w,ξ w + ξ, ξ b i a i, w, i =,..., t. Note that there is a single (scalar) slack variable. This is known as one-slack formulation. Let b = [b,..., b t ], A = [a,..., a t ] and K = A A/λ. The corresponding dual problem is where at optimum w = λ Aα. ntuition: why it is efficient max α, b α α K α, α. The original infinite constraints are approximated by just t constraints in L (t) (w). This is possible because:. The approximation needs to be good only around the optimum.. The effective dimensionality and redundancy of the data are exploited. Solving the corresponding quadratic problem is easy because t is small. Remark. BMRM is a primal solver. Switching to the dual for the subproblems is convenient but completely optional. mplementation 57 / 64 Tricks of the trade: caching / 58 / 64 An attractive aspect is the ease of implementation. A = [] ; B = [] ; 3 minimum = -inf ; 4 while getobjective(w) - minimum > epsilon 5 [a,b] = getcuttingplane(w) ; 6 A = [A, a] ; 7 B = [B, b] ; 8 [w, minimum] = quadraticsolver(lambda, A, B) ; 9 end A simple quadratic solver may do as the problem is small (e.g. MATLAB quadprog). getcuttingplane computes an average of subgradients, in turn obtained by solving the augmented inference problems. w w w 3... L (w) (a, b ) (a, b ) (a 3, b 3 )... L (w) (a, b ) (a, b ) (a 3, b 3 ).... L n (w) (a n, b ) (a n, b n ) (a n3, b n3 )... L(w) (a, b ) (a, b ) (a 3, b 3 )... For each novel w t a new constraint per example is generated by running augmented inference. The overall loss is an average of per-example losses: And so for each cutting plane: a t = n L(w) = n a it (w), i= L i (w) i= b t = n b it (w), i= 59 / 64 6 / 64

16 Tricks of the trade: caching / Tricks of the trade: caching /3 w w w 3... L (w) (a, b ) (a, b ) (a 3, b 3 )... t L (w) (a, b ) (a, b ) (a 3, b 3 )... t.. L n (w) (a n, b ) (a n, b n ) (a n3, b n3 )... tn L(w) (a, b ) (a, b ) (a 3, b 3 )... (a t+δt, b t+δt ) Caching recombines constraints generated so far to obtain a novel cutting plane without running augmented inference (expensive) [Joachims, 6, Felzenszwalb et al., 8].. For each example i =,..., n pick the most violated constraint in the cache: ti = argmax b it a it, w. t=,...,t. Now form a novel cutting plane by recombining the existing constraints: a t+δt = n i= a it i (w), b t+δt = n i= b it i (w),. Caching is very important for problems like object detection in which inference is very expensive (seconds or minutes per image). Consider for example [Felzenszwalb et al., 8] object detector. With 5 training images and five seconds / image for inference it requires an hour for one round of augmented inference! Thus the solver should be iterated until examples in the cache are correctly separated. t is pointless to fetch more before the solution has stabilised due to the huge cost. Preventive caching. During a round of inference it is also possible to return and store in the cache a small set of highly violated constraints. They may become most violated at a later iteration. Tricks of the trade: incremental training 6 / 64 Bibliography 6 / 64 Another speedup is to train the model gradually, by adding progressively more training samples. The intuition is that a lot of samples are only needed to refine the model. P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. n Proc. CVPR, 8. T. Joachims. Training linear SVMs in linear time. n Proc. KDD, 6. T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(), 9. K. C. Kiwiel. Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46, 99. C. Lemaréchal, A. Nemirovskii, and Y. Nesterov. New variants of bundle methods. Mathematical Programming, 69, 995. Alex J. Smola and Bernhard Scholkopf. A tutorial on support vector regression. Statistics and Computing, 4(3), 4. C. H. Teo, S. V. N. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, (55), / / 64

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where. Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Linear Programming. March 14, 2014

Linear Programming. March 14, 2014 Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Proximal mapping via network optimization

Proximal mapping via network optimization L. Vandenberghe EE236C (Spring 23-4) Proximal mapping via network optimization minimum cut and maximum flow problems parametric minimum cut problem application to proximal mapping Introduction this lecture:

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Mathematics Review for MS Finance Students

Mathematics Review for MS Finance Students Mathematics Review for MS Finance Students Anthony M. Marino Department of Finance and Business Economics Marshall School of Business Lecture 1: Introductory Material Sets The Real Number System Functions,

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Direct Loss Minimization for Structured Prediction

Direct Loss Minimization for Structured Prediction Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago mcallester@ttic.edu Tamir Hazan TTI-Chicago tamir@ttic.edu Joseph Keshet TTI-Chicago jkeshet@ttic.edu Abstract In discriminative

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm. Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

More information

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform

More information

Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

More information

CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all. 1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Convex analysis and profit/cost/support functions

Convex analysis and profit/cost/support functions CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization 2.1. Introduction Suppose that an economic relationship can be described by a real-valued

More information

BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBERT SPACE REVIEW BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

More information

Online learning of multi-class Support Vector Machines

Online learning of multi-class Support Vector Machines IT 12 061 Examensarbete 30 hp November 2012 Online learning of multi-class Support Vector Machines Xuan Tuan Trinh Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Support Vector Machines

Support Vector Machines CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best)

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Discrete Optimization

Discrete Optimization Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer

More information

24. The Branch and Bound Method

24. The Branch and Bound Method 24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm. Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

More information

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo

More information

Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

More information

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

the points are called control points approximating curve

the points are called control points approximating curve Chapter 4 Spline Curves A spline curve is a mathematical representation for which it is easy to build an interface that will allow a user to design and control the shape of complex curves and surfaces.

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

An interval linear programming contractor

An interval linear programming contractor An interval linear programming contractor Introduction Milan Hladík Abstract. We consider linear programming with interval data. One of the most challenging problems in this topic is to determine or tight

More information

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach MASTER S THESIS Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach PAULINE ALDENVIK MIRJAM SCHIERSCHER Department of Mathematical

More information

Gautam Appa and H. Paul Williams A formula for the solution of DEA models

Gautam Appa and H. Paul Williams A formula for the solution of DEA models Gautam Appa and H. Paul Williams A formula for the solution of DEA models Working paper Original citation: Appa, Gautam and Williams, H. Paul (2002) A formula for the solution of DEA models. Operational

More information

Smoothing Multivariate Performance Measures

Smoothing Multivariate Performance Measures Journal of Machine Learning Research 13 (2012) 3589 3646 Submitted 11/11; Revised 9/12; Published 12/12 Smoothing Multivariate Performance Measures Xinhua Zhang Department of Computing Science University

More information

Convex Programming Tools for Disjunctive Programs

Convex Programming Tools for Disjunctive Programs Convex Programming Tools for Disjunctive Programs João Soares, Departamento de Matemática, Universidade de Coimbra, Portugal Abstract A Disjunctive Program (DP) is a mathematical program whose feasible

More information

What is Linear Programming?

What is Linear Programming? Chapter 1 What is Linear Programming? An optimization problem usually has three essential ingredients: a variable vector x consisting of a set of unknowns to be determined, an objective function of x to

More information

Largest Fixed-Aspect, Axis-Aligned Rectangle

Largest Fixed-Aspect, Axis-Aligned Rectangle Largest Fixed-Aspect, Axis-Aligned Rectangle David Eberly Geometric Tools, LLC http://www.geometrictools.com/ Copyright c 1998-2016. All Rights Reserved. Created: February 21, 2004 Last Modified: February

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Chapter 6: Sensitivity Analysis

Chapter 6: Sensitivity Analysis Chapter 6: Sensitivity Analysis Suppose that you have just completed a linear programming solution which will have a major impact on your company, such as determining how much to increase the overall production

More information

Linear Programming I

Linear Programming I Linear Programming I November 30, 2003 1 Introduction In the VCR/guns/nuclear bombs/napkins/star wars/professors/butter/mice problem, the benevolent dictator, Bigus Piguinus, of south Antarctica penguins

More information