Introduction to Online Learning Theory

Transcription

1 Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, / 53

2 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 2 / 53

4 Text document classification [Bottou, 2007] Data set: Reuters RCV1: Set of documents from Reuters News published during training examples, testing examples TF-IDF features. For each document, some topics (categories) were assigned. For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial). 4 / 53

5 Document example <?xml version="1.0" encoding="iso "?> <newsitem itemid="2330" id="root" date=" " xml:lang="en"> <title>usa: Tylan stock jumps; weighs sale of company.</title> <headline>tylan stock jumps; weighs sale of company.</headline> <dateline>san DIEGO</dateline> <text> the stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers. tylan was up $2.50 to $12.75 in early trading on the Nasdaq market. the company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser. </text> <copyright>(c) Reuters Limited 1996</copyright> <metadata> <codes class="bip:countries:1.0"> <code code="usa"> </code> </codes> <codes class="bip:industries:1.0"> <code code="i34420"> </code> </codes> <codes class="bip:topics:1.0"> <code code="c15"> </code> <code code="c152"> </code> <code code="c18"> </code> <code code="c181"> </code> <code code="ccat"> </code> </codes> <dc element="dc.publisher" value="reuters Holdings Plc"/> <dc element="dc.date.published" value=" "/> <dc element="dc.source" value="reuters"/> 5 / 53

6 Experiment Two types of loss functions: Logistic loss (logistic regression) Hinge loss (SVM) Two types of learning: Standard (batch setting): minimization of the (regularized) empirical risk on the training data. Online gradient descent (stochastic gradient descent) L 2 regularization. Source: 6 / 53

7 Results Hinge loss method comp. time training error test error SVMLight (batch) sec % SVMPerf (batch) 66 sec % SGD 1.4 sec % Logistic loss method comp. time training error test error LibLinear (batch) 30 sec % SGD 2.3 sec % 7 / 53

8 Some more examples hinge loss Data set #objects #features time LIBSVM time SGD Reuters days 7sec Translation many days 7sec SuperTag h 1sec Voicetone h 1sec 8 / 53

10 A typical scheme in machine learning learning algorithm function (classifier) w S : X Y predictions ŷ 5 = w S (x 5 ) ŷ 6 = w S (x 6 ) ŷ 7 = w S (x 7 ) accuracy l(y 5, ŷ 5 ) l(y 5, ŷ 6 ) l(y 7, ŷ 7 ) (x 1, y 1 ) (x2, y 2 ) (x 5,?) (x6,?) training set S test set T feedback (x 3, y 3 ) (x4, y 4 ) (x 7,?) y 5, y 6, y 7 10 / 53

11 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) 11 / 53

12 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) No reasonable answer without any assumptions ( no free lunch ). Training data and test data must be in some sense similar. 11 / 53

13 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? 12 / 53

14 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? = Empirical risk minimization. 12 / 53

15 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n 13 / 53

16 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n best test set performance in W overhead for learning on a finite training set 13 / 53

17 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( dvc n ). 14 / 53

18 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance then: w S = arg min L S(w), w W E S P L(w S ) = min w W L(w) + O ( dvc n ). best test set performance in W overhead for learning on a finite training set 14 / 53

19 Some remarks on statistical learning theory The bounds are asymptotically tight. Typically d VC d, the number of parameters, so that: ( ) d L(w S ) = min L(w) + O. w W n Similar results (sometimes better) for losses other than 0/1. Holds for any distribution P. Improvements: data dependent bounds. The theory only tells you what happens for a given class W; choosing W is an art (domain knowledge). 15 / 53

21 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. 17 / 53

22 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? 17 / 53

23 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? we can! = online learning theory 17 / 53

24 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = / 53

25 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53

26 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53

27 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53

28 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53

29 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53

30 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53

31 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53

32 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53

33 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% / 53

34 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = / 53

35 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 19 / 53

36 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53

37 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53

38 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 19 / 53

39 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

40 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

41 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 19 / 53

42 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

43 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

44 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 19 / 53

45 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

46 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

47 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10%... 10% 80% 50% 10%... 20% 70% 50% 30%... 60% 30% 50% 80%... 30% 65% 50% 10% / 53

48 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. 20 / 53

49 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 20 / 53

50 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = / 53

51 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = 1.35 Regret of the learner: = The goal is to have small regret for any data sequence. 20 / 53

52 Online learning framework i i + 1 learner (strategy) w i : X Y prediction ŷ i = w i (x i ) suffered loss l(y i, ŷ i ) new instance (x i,?) feedback: y i 21 / 53

53 Online learning framework Set of strategies (actions) W; known loss function l. Learner starts with some initial strategy (action) w 1. For i = 1, 2,...: 1 Learner observes instance x i. 2 Learner predicts with ŷ i = w i (x i ). 3 The environment reveals outcome y i. 4 Learner suffers loss l i (w i ) = l(y i, ŷ i ). 5 Learner updates its strategy w i w i / 53

54 Online learning framework The goal of the learner is to be close to the best w in hindsight. Cumulative loss of the learner: ˆL n = n l i (w i ). i=1 Cumulative loss of the best strategy w in hindsight: n L n = min l i (w). w W Regret of the learner: i=1 R n = ˆL n L n. The goal is to minimize regret over all possible data sequences. 23 / 53

55 Prediction with expert advice N prediction strategies ( experts ) to follow. Instance x = (x 1,..., x N ): vector of experts predictions. Strategy w = (w 1,..., w N ): probability vector N k=1 w k = 1, w k 0 (learner s beliefs about experts, or probability of following a given expert). Learner s prediction: N ŷ = w(x) = w x = w k x k. k=1 Regret: competing with the best expert R n = ˆL n L n, L n = min k=1,...,n L n(k). where L n (k) is the loss of kth expert. 24 / 53

56 Applications of experts advice setting Combining advices from actual experts :-) Combining N learning algorithms/classifiers. Gambling (e.g., horse racing). Portfolio selection. Routing/shortest path. Spam filtering/document classification. Learning boolean formulae. Ranking/ordering / 53

58 Follow the leader Follow the leader (FTL) At iteration i, choose an expert which has the smallest loss on the past data: k min = arg min k=1,...,n j=1 i 1 l j (k) = arg min L i 1(k). k=1,...,n Corresponding strategy w: w kmin = 1, w k = 0 for k k min. 27 / 53

59 Failure of FTL expert loss / 53

60 Failure of FTL expert loss 75% 0 25% / 53

61 Failure of FTL expert loss 75% 0 25% 0 50% 0 28 / 53

62 Failure of FTL expert loss 75% % % / 53

63 Failure of FTL expert loss 75% 100% % 0% % / 53

64 Failure of FTL expert loss 75% 100% % 0% % 0% / 53

65 Failure of FTL expert loss 75% 100% % 0% % 0% / 53

66 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% / 53

67 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53

68 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53

69 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% / 53

70 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53

71 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53

72 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% / 53

73 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53

74 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53

75 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% / 53

76 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

77 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

78 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

79 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53

80 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53

81 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n / 53

82 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. Learner must hedge its bets on experts! 28 / 53

83 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > / 53

84 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). w i+1,k = w i,ke ηl i(k) Z i, 29 / 53

85 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). Unwinding this update: w i+1,k = w i,ke ηl i(k) Z i, w i+1,k = e ηl i(k) Z i where L i (k) = j i l j(k) and Z i = N k=1 e ηl i(k). 29 / 53

86 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) 30 / 53

87 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. posterior probability w i+1,k data likelihood e ηl i(k) prior probability w i,k P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) normalization Z i 30 / 53

88 Hedge example (η = 2) 30% 10% 20% 60% 30% / 53

89 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53

90 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53

91 Hedge example (η = 2) 50% 80% 70% 30% 64% / 53

92 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53

93 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53

94 Hedge example (η = 2) 50% 50% 50% 50% 50% / 53

95 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53

96 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53

97 Hedge example (η = 2) 10% 10% 30% 80% 21% / 53

98 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53

99 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53

100 Hedge analysis Regret bound For any data sequence, when η = R n 8 log N n, n log N 2 Regret bound For any data sequence, when η = 2 log N L n, R n 2L n log N + log N Both bounds are tight. 32 / 53

101 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n 33 / 53

102 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n Essentially the same performance without i.i.d. assumption, but at the price of averaging! 33 / 53

103 Prediction with expert advice: Extensions Large (or countably infinite) class of experts. Concept drift: competing with the best sequence of experts. Competing with the best small set of recurring experts. Ranking: competing with the best permutation. Partial feedback: multi-armed bandits / 53

104 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. 35 / 53

105 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. Competing with the best small set of m recurring experts Mixture over all past Hedge weights. 35 / 53

107 Classification/regression framework Instances x: feature vectors. Outcome y: class/real output. Strategy w W: parameter vector: w = (w 1,..., w d ) R d. W can be R d or can be a regularization ball W = {w : w p B}. Prediction is linear: ŷ = w x. Loss l depends on the task we want to solve, but we assume l(y, ŷ) is convex in ŷ. 37 / 53

108 Examples Linear regression: y R and l(y, ŷ) = (y ŷ) 2. Logistic regression: y {0, 1} and ( l(y, ŷ) = log 1 + e yŷ). Support vector machines: y {0, 1} and l(y, ŷ) = (1 yŷ) + 38 / 53

109 Examples loss hinge (SVM) square logistic prediction Logistic and hinge losses plotted for y = 1. Squared error loss plotted for y = / 53

110 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). Source: larry/, 40 / 53

111 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). If we have a set of constraints w W, after each step we need to project back to W: w i+1 := arg min w W w i+1 w 2. Source: larry/, 40 / 53

112 Online (stochastic) gradient descent Algorithm Start with any initial vector w 1 W. For i = 1, 2,...: 1 Observe input vector x i. 2 Predict with ŷ i = w i x i. 3 Outcome y i is revealed. 4 Suffer loss l i (w i ) = l(y i, ŷ i ). 5 Push the weight vector toward the negative gradient of the loss: w i+1 := w i η i wi l i (w i ). 6 If w i+1 / W, project it back to W: w i+1 := arg min w W w i+1 w / 53

113 Online vs. standard gradient descent The function we want to minimize: n f(w) = l j (w). j=1 Standard = batch GD w i+1 : = w i η i wi f(w i ) = w i η i wi l j (w i ) O(n) per iteration, need to see all data. j Online GD w i+1 := w i η i wi l i (w i ). O(1) per iteration, need to see a single data point. 42 / 53

114 Online (stochastic) gradient descent W w 1 43 / 53

115 Online (stochastic) gradient descent η w1 l 1 (w 1 ) W w 1 43 / 53

116 Online (stochastic) gradient descent w 2 η w1 l 1 (w 1 ) W w 1 43 / 53

117 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 1 43 / 53

118 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 3 w 1 43 / 53

121 Online (stochastic) gradient descent projection w 4 w 2 W w 3 w 1 43 / 53

122 Calculating the gradient The gradient wi l i (w i ) can be obtained by applying a chain rule: l i (w) = l(y i, w x i ) w k w k l(y, ŷ) (w x i ) = ŷ=w ŷ x i w k l(y, ŷ) = ŷ=w x ik. ŷ x i If we denote l l(y,ŷ) i (w) := ŷ=w, then we can write x i ŷ wi l i (w i ) = l i(w i )x i 44 / 53

123 Update rules for specific losses Linear regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) Update: w i+1 = w i + 2η i (y i ŷ i )x i. 45 / 53

124 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i 45 / 53

125 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 45 / 53

126 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 perceptron! 45 / 53

127 Projection w i := arg min w W w i w / 53

128 Projection w i := arg min w W w i w 2. When W = R d no projection step. 46 / 53

129 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. 46 / 53

130 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. When W = {w : d k=1 w k B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0. Equivalent to L 1 regularization, results in sparse solutions. 46 / 53

131 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 47 / 53

132 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 projected w i projected w i 47 / 53

133 Convergence of online gradient descent Theorem Let 0 W. Assume w l i (w) L and let W = max w W w. Then, with w 1 = 0, η i = 1 W i L regret is bounded by: the R n W L 2n, so that the per-iteration regret: 1 2 n R n W L n. 48 / 53

134 Exponentiated Gradient [Kivinen & Warmuth 1996] Gradient descent w i+1 := w i η i wi l i (w i ). Exponentiated descent w i+1 := 1 Z i w i e η i wi l i (w i ). Direct extension of Hedge for classification/regression framework. Requires positive weights, but can be applied in a general setting using doubling trick. Works much better than online gradient descent when: d is very large (many features) only a small number of features is relevant. Works very well when the best model is sparse, but does not keep sparse solution itself! 49 / 53

135 Convergence of exponentiated gradient Theorem Let W be a positive orthant of L 1 -ball with radius W, and assume w l i (w) L. Then, with w 1 = ( 1 d,..., d) 1, η i = 1 W 8 log d i L the regret is bounded by: n log d R n W L, 2 so that the per-iteration regret: 1 log d n R n W L 2n. 50 / 53

136 Extensions Concept drift: competing with drifting parameter vectors. Partial feedback: contextual multi-armed bandit problems. Improvements for some (strongly convex, exp-concave) loss functions. Infinite-dimensional feature spaces via kernel trick. Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices) / 53

138 Conclusions A theoretical framework for learning without i.i.d. assumption. Performance bounds often simpler to prove than in the stochastic setting. Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial information (bandits), etc. Results in online algorithms directly applicable to large-scale learning problems. Most of currently used offline learning algorithms employ online learning as an optimization routine. 53 / 53