Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
|
|
- Andra Boone
- 8 years ago
- Views:
Transcription
1 Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research
2 Motivation PAC learning: distribution fixed over time (training and test). IID assumption. On-line learning: no distributional assumption. worst-case analysis (adversarial). mixed training and test. Performance measure: mistake model, regret. 2
3 Prediction with expert advice Linear classification This Lecture 3
4 General On-Line Setting For t=1 to T do receive instance predict y t Y. receive label incur loss Classification: Regression: y t Y. L(y t,y t ). x t X. Y ={0, 1},L(y, y )= y Y R,L(y, y )=(y y) 2. Objective: minimize total loss y. T t=1 L( y t,y t ). 4
5 Prediction with Expert Advice For t=1 to T do receive instance predict y t Y. receive label incur loss and advice Objective: minimize regret, i.e., difference of total loss incurred and that of best expert. Regret(T )= y t Y. L(y t,y t ). T t=1 x t X y t,i Y,i [1,N]. L(y t,y t ) N min i=1 T t=1 L(y t,i,y t ). 5
6 Mistake Bound Model Definition: the maximum number of mistakes a learning algorithm Lmakes to learn c is defined by M L (c) = max mistakes(l, c). x 1,...,x T Definition: for any concept class C the maximum number of mistakes a learning algorithm L makes is M L (C) =max M L(c). c C A mistake bound is a bound M on M L (C). 6
7 Halving Algorithm see (Mitchell, 1997) Halving(H) 1 H 1 H 2 for t 1 to T do 3 Receive(x t ) 4 y t MajorityVote(H t,x t ) 5 Receive(y t ) 6 if y t = y t then 7 H t+1 {c H t : c(x t )=y t } 8 return H T +1 7
8 Halving Algorithm - Bound Theorem: Let H be a finite hypothesis set, then M Halving(H) log 2 H. Proof: At each mistake, the hypothesis set is reduced at least by half. (Littlestone, 1988) 8
9 VC Dimension Lower Bound Theorem: Let for H. Then, opt(h) (Littlestone, 1988) be the optimal mistake bound VCdim(H) opt(h) M Halving(H) log 2 H. Proof: for a fully shattered set, form a complete binary tree of the mistakes with height VCdim(H). 9
10 Weighted Majority Algorithm Weighted-Majority(N experts) 1 for i 1 to N do 2 w 1,i 1 3 for t 1 to T do 4 Receive(x t ) 5 y t 1 P N yt,i =1 w t P N y t,i =0 w t 6 Receive(y t ) 7 if y t = y t then 8 for i 1 to N do 9 if (y t,i = y t ) then 10 w t+1,i w t,i 11 else w t+1,i w t,i 12 return w T +1 (Littlestone and Warmuth, 1988) y t,y t,i {0, 1}. [0, 1). weighted majority vote 10
11 Weighted Majority - Bound Theorem: Let m t be the number of mistakes made by the WM algorithm till time t and m t that of the best expert. Then, for all t, Thus, Realizable case: m t log N + m t log 1 log 2 1+ m t O(log N)+constant best expert. m t Halving algorithm: =0. O(log N).. 11
12 Weighted Majority - Proof Potential: Upper bound: after each error, Thus, t = Lower bound: for any expert i, Comparison: N i=1 w t,i. t+1 1/2+1/2 t = 1+ m 2 1+ t t N. 2 m t 1+ 2 m t N t. t w t,i = m t,i. m t log log N + m t log m t log 1+ log N + m t log 1. 12
13 Weighted Majority - Notes Advantage: remarkable bound requiring no assumption. Disadvantage: no deterministic algorithm can achieve a regret R T = o(t ) with the binary loss. better guarantee with randomized WM. better guarantee for WM with convex losses. 13
14 Exponential Weighted Average Algorithm: weight update: prediction: y t = w t+1,i w t,i e L(by t,i,y t ) = e L t,i. P N i=1 w t,iy t,i P N i=1 w t,i Theorem: assume that Lis convex in its first argument and takes values in. Then, for any and any sequence y 1,...,y T Y, the regret at satisfies Regret(T ) For = 8 log N/T, total loss incurred by expert i up to time t. [0, 1] >0 T log N + T 8. Regret(T ) (T/2) log N. 14
15 Exponential Weighted Avg - Proof Potential: Upper bound: t t 1 =log =log =log t = log N i=1 w t 1,i e L(by t,i,y t ) N E [e wt 1 i=1 w t 1,i L(by t,i,y t ) ] N i=1 w t,i. E wt 1 exp L(y t,i,y t ) E wt 1 [L(y t,i,y t )] E wt 1 [L(y t,i,y t )] E [L(y t,i,y t )] + w t 1 L( E wt 1 [y t,i ],y t )+ = L(y t,y t ) (Hoeffding s ineq.) (convexity of first arg. of L) 15
16 Exponential Weighted Avg - Proof T Upper bound: summing up the inequalities yields Lower bound: 0 = log Comparison: T t=1 T N i=1 e 0 T t=1 L T,i log N log N max i=1 e L T,i log N N min i=1 L T,i log N L( y t,y t ) L( y t,y t )+ = T t=1 2 T 8. N min i=1 L T,i log N. L( y t,y t )+ N min i=1 L T,i log N + T T 8
17 Exponential Weighted Avg - Notes Advantage: bound on regret per bound is of the form R T log(n). T = O T Disadvantage: choice of requires knowledge of horizon T. 17
18 Doubling Trick Idea: divide time into periods with k =0,...,n, T 2 n 1, and choose in each period. of length [2 k, 2 k+1 1] 2 k k = 8logN 2 k Theorem: with the same assumptions as before, for any T, the following holds: Regret(T ) (T/2) log N + log N/2. 18
19 Doubling Trick - Proof By the previous theorem, for any, I k =[2 k, 2 k+1 1] L Ik N min i=1 L I k,i 2 k /2 log N. Thus, with L T = n k=0 L Ik n k=0 N min i=1 L I k,i + N min i=1 L T,i + n k=0 n k=0 2 k (log N)/2 2 k 2 (log N)/2. n n+1 2 k = = 2(n+1)/ T ( T +1) T i=0 19
20 Notes Doubling trick used in a variety of other contexts and proofs. More general method, learning parameter function of time: t = (8 log N)/t. Constant factor improvement: Regret(T ) 2 (T/2) log N + (1/8) log N. 20
21 Prediction with expert advice Linear classification This Lecture 21
22 Perceptron Algorithm (Rosenblatt, 1958) Perceptron(w 0 ) 1 w 1 w 0 typically w 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) 5 Receive(y t ) 6 if (y t = y t ) then 7 w t+1 w t + y t x t more generally y t x t, >0 8 else w t+1 w t 9 return w T +1 22
23 Separating Hyperplane Margin and errors w x=0 w x=0 ρ ρ y i (w x i ) w 23
24 Perceptron = Stochastic Gradient Descent with Objective function: convex but not differentiable. F (w) = 1 T Stochastic gradient: for each Here: T t=1 f(w, x) =max 0, y(w x). max 0, y t (w x t ) = E [f(w, x)] x D b x t, the update is w t w f(w t, x t ) if di erentiable w t+1 w t otherwise, where >0 is a learning rate parameter. w t + y t x t if y t (w t x t ) < 0 w t+1 w t otherwise. 24
25 Perceptron Algorithm - Bound (Novikoff, 1962) Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, for all t [1,T], y t (v x t ) v Then, the number of mistakes made by the perceptron algorithm is bounded by R 2 / 2. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates.. 25
26 Summing up the assumption inequalities gives: M v t I y tx t v = v t I (w t+1 w t ) v = v w T +1 v w T +1 (definition of updates) (Cauchy-Schwarz ineq.) = w tm + y tm x tm (t m largest t in I) = w tm 2 + x tm 2 +2y tm w tm x tm w tm 2 + R 2 1/2 0 1/2 MR 2 1/2 = MR. (applying the same to previous ts ini) 26
27 Notes: bound independent of dimension and tight. convergence can be slow for small margin, it can be in (2 N ). among the many variants: voted perceptron algorithm. Predict according to where c t is the number of iterations w t survives. {x t : t I} are the support vectors for the perceptron algorithm. non-separable case: does not converge. sign ( t I c t w t ) x, 27
28 Perceptron - Leave-One-Out Analysis Theorem: Let h S be the hypothesis returned by the perceptron algorithm for sample S =(x 1,...,x T ) and let M(S) be the number of updates defining h S. Then, min(m(s),r E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: Let S D m+1 be a sample linearly separable and let x S. If h S {x} misclassifies x, then x must be a support vector for (update at x). Thus, h S R loo (perceptron) M(S) m +1.. D 28
29 SVMs - Leave-One-Out Analysis Theorem: let h S be the optimal hyperplane for a sample S and let N SV (S) be the number of support vectors defining. Then, h S min(n SV (S),R E S D m[r(h S)] E m+1/ 2 2 m+1) S D m+1 m +1 Proof: one part proven in lecture 4. The other part due to for x i misclassified by SVMs. i 1/R 2 m+1 (Vapnik, 1995). 29
30 Comparison Bounds on expected error, not high probability statements. Leave-one-out bounds not sufficient to distinguish SVMs and perceptron algorithm. Note however: same maximum margin m+1 can be used in both. but different radius R m+1 of support vectors. Difference: margin distribution. 30
31 M T Non-Separable Case - L1 Bound Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, inf >0, u 2 1 t I 1 when x t R for all t I, this implies M T inf >0, u 2 1 y t (u x t ) (MM and Rostamizadeh, 2013) + + R + L (u) 1 2, t I x t 2. where L (u) = 1 y t (u x t ) + t I. 31
32 Proof: for any t, 1 up these inequalities for M T t I y t (u x t ) yields:, summing upper-bounding t I (y tu x t ) as in the proof for separable case shows the first inequality. the second inequality is obtained by solving which gives 1 t I y t (u x t ) 1 + y t (u x t ) + M T L (u) 1 + R M T, q R R M +4 L (u) 1 T 2. t I + y t (u x t ). 32
33 Non-Separable Case - L2 Bound (Freund and Schapire, 1998; MM and Rostamizadeh, 2013) Theorem: let I denote the set of rounds at which the Perceptron algorithm makes an update when processing x 1,...,x T and let M T = I. Then, M T inf >0, u 2 1 L (u) L (u) t I x t 2 2. when x t R for all t I, this implies M T inf >0, u 2 1 R + L (u) 2 2, where L (u) = 1 y t (u x t ) + t I. 33
34 Proof: Reduce problem to separable case in higher y dimension. Let l t = 1 t u x t, for. + 1 t I t [1,T] Mapping (similar to trivial mapping): (N +t)th component x t = x t,1. x t,n x t = x t,1. x t,n u u = u 1 Ẓ.. u NZ y 1 l 1 Z y T. l T Z u =1 = Z = 1+ 2 L (u)
35 Observe that the Perceptron algorithm makes the same predictions and makes updates at the same rounds when processing x 1,...,x T. For any t I, y t (u x t )=y t = y tu x t Z u x t Z + y t l t Z + l t Z = 1 Z y tu x t +[ y t (u x t )] + Z. Summing up and using the proof in the separable case yields: M T Z y t (u x t ) x t 2. t I t I 35
36 The inequality can be rewritten as M 2 T L (u) 2 r 2 +M 2 2 T = r2 2 + r2 L (u) 2 + M 2 T +M 2 2 T L (u) 2, where r = t I x t 2. Selecting to minimize the bound gives and leads to 2 = L (u) 2r M T M 2 T r M T L (u) r + M T L (u) 2 =( r + M T L (u) 2 ) 2. Solving the second-degree inequality r M T M T L (u) 2 0 yields directly the first statement. The second one results from replacing r with M T R. 36
37 Dual Perceptron Algorithm Dual-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s (x s x t )) 37
38 Kernel Perceptron Algorithm K PDS kernel. Kernel-Perceptron( 0 ) 1 0 typically 0 = 0 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn( 5 Receive(y t ) 6 if (y t = y t ) then 7 t t +1 8 return T s=1 sy s K(x s,x t )) (Aizerman et al., 1964) 38
39 Winnow Algorithm Winnow( ) 1 w 1 1/N 2 for t 1 to T do 3 Receive(x t ) 4 y t sgn(w t x t ) y t { 1, +1} 5 Receive(y t ) 6 if (y t = y t ) then 7 Z t N i=1 w t,i exp( y t x t,i ) 8 for i 1 to N do 9 w t+1,i w t,i exp( y t x t,i ) Z t 10 else w t+1 w t 11 return w T +1 (Littlestone, 1988) 39
40 Notes Winnow = weighted majority: for y t,i =x t,i { 1, +1}, sgn(w t x t ) coincides with the majority vote. multiplying by e or e the weight of correct or incorrect experts, is equivalent to multiplying by =e 2 the weight of incorrect ones. Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family). 40
41 Winnow Algorithm - Bound Theorem: Assume that x t R for all t [1,T] and that for some >0 and v R N, v 0 for all t [1,T], y t (v x t ) v 1. Then, the number of mistakes made by the Winnow algorithm is bounded by 2(R 2 / 2 )logn. Proof: Let I be the set of ts at which there is an update and let M be the total number of updates. 41
42 Winnow Algorithm - Bound Potential: t = Upper bound: for each t in I, N i=1 t+1 t = N i=1 = N i=1 =logz t v i v log v i/ v w t,i. v i v 1 log w t,i w t+1,i v i v 1 log N i=1 Z t exp( y t x t,i ) v i v 1 y t x t,i N log i=1 w t,i exp( y t x t,i ) =loge exp( y t x t ) wt (relative entropy) (Hoe ding) log exp( 2 (2R ) 2 /8) + y t w t x t 2 R 2 /2. 42
43 Winnow Algorithm - Bound Upper bound: summing up the inequalities yields Lower bound: note that 1 = N i=1 and for all, T +1 1 M( 2 R 2 /2 ). v i v 1 log v i/ v 1 1/N =logn + N t t 0 i=1 v i v 1 log v i v 1 (property or relative entropy). log N Thus, T log N = log N. Comparison: we obtain For log N M( 2 R 2 /2 ). = R 2 M 2logN R
44 Notes Comparison with perceptron bound: dual norms: norms for x t and v. similar bounds with different norms. each advantageous in different cases: Winnow bound favorable when a sparse set of experts can predict well. For example, if and, log N vs N. Perceptron favorable in opposite situation. x t {±1} N v =e 1 44
45 Conclusion On-line learning: wide and fast-growing literature. many related topics, e.g., game theory, text compression, convex optimization. online to batch bounds and techniques. online version of batch algorithms, e.g., regression algorithms (see regression lecture). 45
46 References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On- Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COLT ACM Press, Nick Littlestone. From On-Line to Batch Learning. COLT 1989: Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linearthreshold Algorithm" Machine Learning (2)
47 References Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: Tom Mitchell. Machine Learning, McGraw Hill, Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arxiv: , Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, Polytechnic Institute of Brooklyn. Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp , Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York,
1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationOnline Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes
Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationA Potential-based Framework for Online Multi-class Learning with Partial Feedback
A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationOnline Algorithms: Learning & Optimization with No Regret.
Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationTrading regret rate for computational efficiency in online learning with limited feedback
Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai
More informationOnline Classification on a Budget
Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationDUOL: A Double Updating Approach for Online Learning
: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University
More information4.1 Introduction - Online Learning Model
Computational Learning Foundations Fall semester, 2010 Lecture 4: November 7, 2010 Lecturer: Yishay Mansour Scribes: Elad Liebman, Yuval Rochman & Allon Wagner 1 4.1 Introduction - Online Learning Model
More informationOnline Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
More informationHow to Use Expert Advice
NICOLÒ CESA-BIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationOnline Learning with Switching Costs and Other Adaptive Adversaries
Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann
More informationThe p-norm generalization of the LMS algorithm for adaptive filtering
The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationOn Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of
More informationInteractive Machine Learning. Maria-Florina Balcan
Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection
More informationKERNEL methods have proven to be successful in many
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 52, NO. 8, AUGUST 2004 2165 Online Learning with Kernels Jyrki Kivinen, Alexer J. Smola, Robert C. Williamson, Member, IEEE Abstract Kernel-based algorithms
More informationA Drifting-Games Analysis for Online Learning and Applications to Boosting
A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationCS229T/STAT231: Statistical Learning Theory (Winter 2015)
CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you
More informationMonotone multi-armed bandit allocations
JMLR: Workshop and Conference Proceedings 19 (2011) 829 833 24th Annual Conference on Learning Theory Monotone multi-armed bandit allocations Aleksandrs Slivkins Microsoft Research Silicon Valley, Mountain
More information17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationOnline Learning, Stability, and Stochastic Gradient Descent
Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationThe Online Set Cover Problem
The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S
More informationThe Set Covering Machine
Journal of Machine Learning Research 3 (2002) 723-746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationAchieving All with No Parameters: AdaNormalHedge
JMLR: Workshop and Conference Proceedings vol 40:1 19, 2015 Achieving All with No Parameters: AdaNormalHedge Haipeng Luo Department of Computer Science, Princeton University Robert E. Schapire Microsoft
More informationThe Advantages and Disadvantages of Online Linear Optimization
LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA
More information1 Portfolio Selection
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with
More informationStrongly Adaptive Online Learning
Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance
More informationFollow the Leader with Dropout Perturbations
JMLR: Worshop and Conference Proceedings vol 35: 26, 204 Follow the Leader with Dropout Perturbations Tim van Erven Département de Mathématiques, Université Paris-Sud, France Wojciech Kotłowsi Institute
More informationPractical Online Active Learning for Classification
Practical Online Active Learning for Classification Claire Monteleoni Department of Computer Science and Engineering University of California, San Diego cmontel@cs.ucsd.edu Matti Kääriäinen Department
More informationList of Publications by Claudio Gentile
List of Publications by Claudio Gentile Claudio Gentile DiSTA, University of Insubria, Italy claudio.gentile@uninsubria.it November 6, 2013 Abstract Contains the list of publications by Claudio Gentile,
More informationGI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationLecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
More informationHow To Learn From Noisy Distributions On Infinite Dimensional Spaces
Learning Kernel Perceptrons on Noisy Data using Random Projections Guillaume Stempfel, Liva Ralaivola Laboratoire d Informatique Fondamentale de Marseille, UMR CNRS 6166 Université de Provence, 39, rue
More informationOnline Passive-Aggressive Algorithms
Journal of Machine Learning Research 7 2006) 551 585 Submitted 5/05; Published 3/06 Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Joseph Keshet Shai Shalev-Shwartz Yoram Singer School of
More informationHow To Train A Classifier With Active Learning In Spam Filtering
Online Active Learning Methods for Fast Label-Efficient Spam Filtering D. Sculley Department of Computer Science Tufts University, Medford, MA USA dsculley@cs.tufts.edu ABSTRACT Active learning methods
More informationOnline Semi-Supervised Learning
Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationIntroduction to Online Optimization
Princeton University - Department of Operations Research and Financial Engineering Introduction to Online Optimization Sébastien Bubeck December 14, 2011 1 Contents Chapter 1. Introduction 5 1.1. Statistical
More information4 Learning, Regret minimization, and Equilibria
4 Learning, Regret minimization, and Equilibria A. Blum and Y. Mansour Abstract Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive
More informationRevenue Optimization against Strategic Buyers
Revenue Optimization against Strategic Buyers Mehryar Mohri Courant Institute of Mathematical Sciences 251 Mercer Street New York, NY, 10012 Andrés Muñoz Medina Google Research 111 8th Avenue New York,
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationOnline Learning: Theory, Algorithms, and Applications. Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz
Online Learning: Theory, Algorithms, and Applications Thesis submitted for the degree of Doctor of Philosophy by Shai Shalev-Shwartz Submitted to the Senate of the Hebrew University July 2007 This work
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationSupport Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced
More informationOptimal Strategies and Minimax Lower Bounds for Online Convex Games
Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More information2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationMind the Duality Gap: Logarithmic regret algorithms for online optimization
Mind the Duality Gap: Logarithmic regret algorithms for online optimization Sham M. Kakade Toyota Technological Institute at Chicago sham@tti-c.org Shai Shalev-Shartz Toyota Technological Institute at
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationA fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
More informationUniversal Algorithm for Trading in Stock Market Based on the Method of Calibration
Universal Algorithm for Trading in Stock Market Based on the Method of Calibration Vladimir V yugin Institute for Information Transmission Problems, Russian Academy of Sciences, Bol shoi Karetnyi per.
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationMutual Online Concept Learning for Multiple Agents
Mutual Online Concept Learning for Multiple Agents Jun Wang Les Gasser Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Champaign, IL 61820, USA {junwang4,
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
More informationOnline Feature Selection for Mining Big Data
Online Feature Selection for Mining Big Data Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Rong Jin School of Computer Engineering, Nanyang Technological University, Singapore Department of Computer Science
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationThe Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationFilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationGambling and Data Compression
Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction
More informationOnline Learning and Online Convex Optimization. Contents
Foundations and Trends R in Machine Learning Vol. 4, No. 2 (2011) 107 194 c 2012 S. Shalev-Shwartz DOI: 10.1561/2200000018 Online Learning and Online Convex Optimization By Shai Shalev-Shwartz Contents
More informationSparse Online Learning via Truncated Gradient
Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationAdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
More informationA Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
More informationSteven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore
Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda
More informationFactoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
More informationAn Alternative Ranking Problem for Search Engines
An Alternative Ranking Problem for Search Engines Corinna Cortes 1, Mehryar Mohri 2,1, and Ashish Rastogi 2 1 Google Research, 76 Ninth Avenue, New York, NY 10011 2 Courant Institute of Mathematical Sciences,
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationBoosting. riedmiller@informatik.uni-freiburg.de
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationSupport Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationScalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationSupport Vector Machines for Classification and Regression
UNIVERSITY OF SOUTHAMPTON Support Vector Machines for Classification and Regression by Steve R. Gunn Technical Report Faculty of Engineering, Science and Mathematics School of Electronics and Computer
More informationFast Kernel Classifiers with Online and Active Learning
Journal of Machine Learning Research 6 (2005) 1579 1619 Submitted 3/05; Published 9/05 Fast Kernel Classifiers with Online and Active Learning Antoine Bordes NEC Laboratories America 4 Independence Way
More informationCSC 411: Lecture 07: Multiclass Classification
CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass
More informationFeed-Forward mapping networks KAIST 바이오및뇌공학과 정재승
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승 How much energy do we need for brain functions? Information processing: Trade-off between energy consumption and wiring cost Trade-off between energy consumption
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More information