Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 2 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 3 / 53
Text document classification [Bottou, 2007] Data set: Reuters RCV1: Set of 800 000 documents from Reuters News published during 1996-1997 781 265 training examples, 23 149 testing examples. 47 152 TF-IDF features. For each document, some topics (categories) were assigned. For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial). 4 / 53
Document example <?xml version="1.0" encoding="iso-8859-1"?> <newsitem itemid="2330" id="root" date="1996-08-20" xml:lang="en"> <title>usa: Tylan stock jumps; weighs sale of company.</title> <headline>tylan stock jumps; weighs sale of company.</headline> <dateline>san DIEGO</dateline> <text> <p>the stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.</p> <p>tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.</p> <p>the company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.</p> </text> <copyright>(c) Reuters Limited 1996</copyright> <metadata> <codes class="bip:countries:1.0"> <code code="usa"> </code> </codes> <codes class="bip:industries:1.0"> <code code="i34420"> </code> </codes> <codes class="bip:topics:1.0"> <code code="c15"> </code> <code code="c152"> </code> <code code="c18"> </code> <code code="c181"> </code> <code code="ccat"> </code> </codes> <dc element="dc.publisher" value="reuters Holdings Plc"/> <dc element="dc.date.published" value="1996-08-20"/> <dc element="dc.source" value="reuters"/> 5 / 53
Experiment Two types of loss functions: Logistic loss (logistic regression) Hinge loss (SVM) Two types of learning: Standard (batch setting): minimization of the (regularized) empirical risk on the training data. Online gradient descent (stochastic gradient descent) L 2 regularization. Source: http://leon.bottou.org/projects/sgd 6 / 53
Results Hinge loss method comp. time training error test error SVMLight (batch) 23 642 sec 0.2275 6.02% SVMPerf (batch) 66 sec 0.2278 6.03% SGD 1.4 sec 0.2275 6.02% Logistic loss method comp. time training error test error LibLinear (batch) 30 sec 0.18907 5.68% SGD 2.3 sec 0.18893 5.66% 7 / 53
Some more examples hinge loss Data set #objects #features time LIBSVM time SGD Reuters 781 000 47 000 2.5 days 7sec Translation 1 000 000 274 000 many days 7sec SuperTag 950 000 46 000 8h 1sec Voicetone 579 000 88 000 10h 1sec 8 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 9 / 53
A typical scheme in machine learning learning algorithm function (classifier) w S : X Y predictions ŷ 5 = w S (x 5 ) ŷ 6 = w S (x 6 ) ŷ 7 = w S (x 7 ) accuracy l(y 5, ŷ 5 ) l(y 5, ŷ 6 ) l(y 7, ŷ 7 ) (x 1, y 1 ) (x2, y 2 ) (x 5,?) (x6,?) training set S test set T feedback (x 3, y 3 ) (x4, y 4 ) (x 7,?) y 5, y 6, y 7 10 / 53
Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) 11 / 53
Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) No reasonable answer without any assumptions ( no free lunch ). Training data and test data must be in some sense similar. 11 / 53
Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? 12 / 53
Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? = Empirical risk minimization. 12 / 53
Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n 13 / 53
Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n best test set performance in W overhead for learning on a finite training set 13 / 53
Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( dvc n ). 14 / 53
Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance then: w S = arg min L S(w), w W E S P L(w S ) = min w W L(w) + O ( dvc n ). best test set performance in W overhead for learning on a finite training set 14 / 53
Some remarks on statistical learning theory The bounds are asymptotically tight. Typically d VC d, the number of parameters, so that: ( ) d L(w S ) = min L(w) + O. w W n Similar results (sometimes better) for losses other than 0/1. Holds for any distribution P. Improvements: data dependent bounds. The theory only tells you what happens for a given class W; choosing W is an art (domain knowledge). 15 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 16 / 53
Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. 17 / 53
Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? 17 / 53
Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? we can! = online learning theory 17 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25% 18 / 53
Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25%...... 18 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 30% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 30% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53
Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10%... 10% 80% 50% 10%... 20% 70% 50% 30%... 60% 30% 50% 80%... 30% 65% 50% 10%...... 19 / 53
Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. 20 / 53
Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 20 / 53
Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4 10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9 20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3 60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6 30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35 20 / 53
Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4 10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9 20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3 60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6 30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35 Regret of the learner: 1.35 0.9 = 0.45. The goal is to have small regret for any data sequence. 20 / 53
Online learning framework i i + 1 learner (strategy) w i : X Y prediction ŷ i = w i (x i ) suffered loss l(y i, ŷ i ) new instance (x i,?) feedback: y i 21 / 53
Online learning framework Set of strategies (actions) W; known loss function l. Learner starts with some initial strategy (action) w 1. For i = 1, 2,...: 1 Learner observes instance x i. 2 Learner predicts with ŷ i = w i (x i ). 3 The environment reveals outcome y i. 4 Learner suffers loss l i (w i ) = l(y i, ŷ i ). 5 Learner updates its strategy w i w i+1. 22 / 53
Online learning framework The goal of the learner is to be close to the best w in hindsight. Cumulative loss of the learner: ˆL n = n l i (w i ). i=1 Cumulative loss of the best strategy w in hindsight: n L n = min l i (w). w W Regret of the learner: i=1 R n = ˆL n L n. The goal is to minimize regret over all possible data sequences. 23 / 53
Prediction with expert advice N prediction strategies ( experts ) to follow. Instance x = (x 1,..., x N ): vector of experts predictions. Strategy w = (w 1,..., w N ): probability vector N k=1 w k = 1, w k 0 (learner s beliefs about experts, or probability of following a given expert). Learner s prediction: N ŷ = w(x) = w x = w k x k. k=1 Regret: competing with the best expert R n = ˆL n L n, L n = min k=1,...,n L n(k). where L n (k) is the loss of kth expert. 24 / 53
Applications of experts advice setting Combining advices from actual experts :-) Combining N learning algorithms/classifiers. Gambling (e.g., horse racing). Portfolio selection. Routing/shortest path. Spam filtering/document classification. Learning boolean formulae. Ranking/ordering.... 25 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 26 / 53
Follow the leader Follow the leader (FTL) At iteration i, choose an expert which has the smallest loss on the past data: k min = arg min k=1,...,n j=1 i 1 l j (k) = arg min L i 1(k). k=1,...,n Corresponding strategy w: w kmin = 1, w k = 0 for k k min. 27 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 0 0 0 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0 25% 0 0 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0 25% 0 50% 0 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0.75 25% 0.25 50% 0.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 0.25 50% 0.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 0.25 50% 0% 0.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 1.25 50% 0% 1.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 0.75 25% 0% 0% 1.25 50% 0% 1.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 0.75 25% 0% 0% 1.25 50% 0% 100% 1.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 1.75 25% 0% 0% 1.25 50% 0% 100% 2.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 1.25 50% 0% 100% 2.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 1.25 50% 0% 100% 0% 2.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 2.25 50% 0% 100% 0% 3.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 1.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 3.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 1.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 3.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 4.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 4.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 0% 4.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 5.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 5.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 5.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. 28 / 53
Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. Learner must hedge its bets on experts! 28 / 53
Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. 29 / 53
Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). w i+1,k = w i,ke ηl i(k) Z i, 29 / 53
Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). Unwinding this update: w i+1,k = w i,ke ηl i(k) Z i, w i+1,k = e ηl i(k) Z i where L i (k) = j i l j(k) and Z i = N k=1 e ηl i(k). 29 / 53
Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) 30 / 53
Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. posterior probability w i+1,k data likelihood e ηl i(k) prior probability w i,k P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) normalization Z i 30 / 53
Hedge example (η = 2) 30% 10% 20% 60% 30% 1 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 30% 10% 20% 60% 30% 0.3 0.1 0.2 0.6 0.3 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 30% 10% 20% 60% 30% 0.3 0.1 0.2 0.6 0.3 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 50% 80% 70% 30% 64% 1 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 50% 80% 70% 30% 64% 0.5 0.2 0.3 0.7 0.36 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 50% 80% 70% 30% 64% 0.5 0.2 0.3 0.7 0.36 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 50% 50% 50% 50% 50% 1 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 50% 50% 50% 50% 50% 0.5 0.5 0.5 0.5 0.5 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 50% 50% 50% 50% 50% 0.5 0.5 0.5 0.5 0.5 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 10% 10% 30% 80% 21% 1 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 10% 10% 30% 80% 21% 0.1 0.1 0.3 0.8 0.21 0.75 0.5 0.25 0 31 / 53
Hedge example (η = 2) 1 10% 10% 30% 80% 21% 0.1 0.1 0.3 0.8 0.21 0.75 0.5 0.25 0 31 / 53
Hedge analysis Regret bound For any data sequence, when η = R n 8 log N n, n log N 2 Regret bound For any data sequence, when η = 2 log N L n, R n 2L n log N + log N Both bounds are tight. 32 / 53
Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n 33 / 53
Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n Essentially the same performance without i.i.d. assumption, but at the price of averaging! 33 / 53
Prediction with expert advice: Extensions Large (or countably infinite) class of experts. Concept drift: competing with the best sequence of experts. Competing with the best small set of recurring experts. Ranking: competing with the best permutation. Partial feedback: multi-armed bandits.... 34 / 53
Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. 35 / 53
Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. Competing with the best small set of m recurring experts Mixture over all past Hedge weights. 35 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 36 / 53
Classification/regression framework Instances x: feature vectors. Outcome y: class/real output. Strategy w W: parameter vector: w = (w 1,..., w d ) R d. W can be R d or can be a regularization ball W = {w : w p B}. Prediction is linear: ŷ = w x. Loss l depends on the task we want to solve, but we assume l(y, ŷ) is convex in ŷ. 37 / 53
Examples Linear regression: y R and l(y, ŷ) = (y ŷ) 2. Logistic regression: y {0, 1} and ( l(y, ŷ) = log 1 + e yŷ). Support vector machines: y {0, 1} and l(y, ŷ) = (1 yŷ) + 38 / 53
Examples loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 hinge (SVM) square logistic 2 1 0 1 2 prediction Logistic and hinge losses plotted for y = 1. Squared error loss plotted for y = 0. 39 / 53
Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). Source: http://www-bcf.usc.edu/ larry/, http://takisword.files.wordpress.com 40 / 53
Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). If we have a set of constraints w W, after each step we need to project back to W: w i+1 := arg min w W w i+1 w 2. Source: http://www-bcf.usc.edu/ larry/, http://takisword.files.wordpress.com 40 / 53
Online (stochastic) gradient descent Algorithm Start with any initial vector w 1 W. For i = 1, 2,...: 1 Observe input vector x i. 2 Predict with ŷ i = w i x i. 3 Outcome y i is revealed. 4 Suffer loss l i (w i ) = l(y i, ŷ i ). 5 Push the weight vector toward the negative gradient of the loss: w i+1 := w i η i wi l i (w i ). 6 If w i+1 / W, project it back to W: w i+1 := arg min w W w i+1 w 2. 41 / 53
Online vs. standard gradient descent The function we want to minimize: n f(w) = l j (w). j=1 Standard = batch GD w i+1 : = w i η i wi f(w i ) = w i η i wi l j (w i ) O(n) per iteration, need to see all data. j Online GD w i+1 := w i η i wi l i (w i ). O(1) per iteration, need to see a single data point. 42 / 53
Online (stochastic) gradient descent W w 1 43 / 53
Online (stochastic) gradient descent η w1 l 1 (w 1 ) W w 1 43 / 53
Online (stochastic) gradient descent w 2 η w1 l 1 (w 1 ) W w 1 43 / 53
Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 1 43 / 53
Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 3 w 1 43 / 53
Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53
Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53
Online (stochastic) gradient descent projection w 4 w 2 W w 3 w 1 43 / 53
Calculating the gradient The gradient wi l i (w i ) can be obtained by applying a chain rule: l i (w) = l(y i, w x i ) w k w k l(y, ŷ) (w x i ) = ŷ=w ŷ x i w k l(y, ŷ) = ŷ=w x ik. ŷ x i If we denote l l(y,ŷ) i (w) := ŷ=w, then we can write x i ŷ wi l i (w i ) = l i(w i )x i 44 / 53
Update rules for specific losses Linear regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) Update: w i+1 = w i + 2η i (y i ŷ i )x i. 45 / 53
Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i 45 / 53
Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 45 / 53
Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 perceptron! 45 / 53
Projection w i := arg min w W w i w 2. 46 / 53
Projection w i := arg min w W w i w 2. When W = R d no projection step. 46 / 53
Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. 46 / 53
Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. When W = {w : d k=1 w k B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0. Equivalent to L 1 regularization, results in sparse solutions. 46 / 53
L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 47 / 53
L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 projected w i projected w i 47 / 53
Convergence of online gradient descent Theorem Let 0 W. Assume w l i (w) L and let W = max w W w. Then, with w 1 = 0, η i = 1 W i L regret is bounded by: the R n W L 2n, so that the per-iteration regret: 1 2 n R n W L n. 48 / 53
Exponentiated Gradient [Kivinen & Warmuth 1996] Gradient descent w i+1 := w i η i wi l i (w i ). Exponentiated descent w i+1 := 1 Z i w i e η i wi l i (w i ). Direct extension of Hedge for classification/regression framework. Requires positive weights, but can be applied in a general setting using doubling trick. Works much better than online gradient descent when: d is very large (many features) only a small number of features is relevant. Works very well when the best model is sparse, but does not keep sparse solution itself! 49 / 53
Convergence of exponentiated gradient Theorem Let W be a positive orthant of L 1 -ball with radius W, and assume w l i (w) L. Then, with w 1 = ( 1 d,..., d) 1, η i = 1 W 8 log d i L the regret is bounded by: n log d R n W L, 2 so that the per-iteration regret: 1 log d n R n W L 2n. 50 / 53
Extensions Concept drift: competing with drifting parameter vectors. Partial feedback: contextual multi-armed bandit problems. Improvements for some (strongly convex, exp-concave) loss functions. Infinite-dimensional feature spaces via kernel trick. Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices).... 51 / 53
Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 52 / 53
Conclusions A theoretical framework for learning without i.i.d. assumption. Performance bounds often simpler to prove than in the stochastic setting. Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial information (bandits), etc. Results in online algorithms directly applicable to large-scale learning problems. Most of currently used offline learning algorithms employ online learning as an optimization routine. 53 / 53