Introduction to Online Learning Theory

Size: px
Start display at page:

Download "Introduction to Online Learning Theory"

Transcription

1 Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, / 53

2 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 2 / 53

3 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 3 / 53

4 Text document classification [Bottou, 2007] Data set: Reuters RCV1: Set of documents from Reuters News published during training examples, testing examples TF-IDF features. For each document, some topics (categories) were assigned. For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial). 4 / 53

5 Document example <?xml version="1.0" encoding="iso "?> <newsitem itemid="2330" id="root" date=" " xml:lang="en"> <title>usa: Tylan stock jumps; weighs sale of company.</title> <headline>tylan stock jumps; weighs sale of company.</headline> <dateline>san DIEGO</dateline> <text> <p>the stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.</p> <p>tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.</p> <p>the company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.</p> </text> <copyright>(c) Reuters Limited 1996</copyright> <metadata> <codes class="bip:countries:1.0"> <code code="usa"> </code> </codes> <codes class="bip:industries:1.0"> <code code="i34420"> </code> </codes> <codes class="bip:topics:1.0"> <code code="c15"> </code> <code code="c152"> </code> <code code="c18"> </code> <code code="c181"> </code> <code code="ccat"> </code> </codes> <dc element="dc.publisher" value="reuters Holdings Plc"/> <dc element="dc.date.published" value=" "/> <dc element="dc.source" value="reuters"/> 5 / 53

6 Experiment Two types of loss functions: Logistic loss (logistic regression) Hinge loss (SVM) Two types of learning: Standard (batch setting): minimization of the (regularized) empirical risk on the training data. Online gradient descent (stochastic gradient descent) L 2 regularization. Source: 6 / 53

7 Results Hinge loss method comp. time training error test error SVMLight (batch) sec % SVMPerf (batch) 66 sec % SGD 1.4 sec % Logistic loss method comp. time training error test error LibLinear (batch) 30 sec % SGD 2.3 sec % 7 / 53

8 Some more examples hinge loss Data set #objects #features time LIBSVM time SGD Reuters days 7sec Translation many days 7sec SuperTag h 1sec Voicetone h 1sec 8 / 53

9 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 9 / 53

10 A typical scheme in machine learning learning algorithm function (classifier) w S : X Y predictions ŷ 5 = w S (x 5 ) ŷ 6 = w S (x 6 ) ŷ 7 = w S (x 7 ) accuracy l(y 5, ŷ 5 ) l(y 5, ŷ 6 ) l(y 7, ŷ 7 ) (x 1, y 1 ) (x2, y 2 ) (x 5,?) (x6,?) training set S test set T feedback (x 3, y 3 ) (x4, y 4 ) (x 7,?) y 5, y 6, y 7 10 / 53

11 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) 11 / 53

12 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) No reasonable answer without any assumptions ( no free lunch ). Training data and test data must be in some sense similar. 11 / 53

13 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? 12 / 53

14 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? = Empirical risk minimization. 12 / 53

15 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n 13 / 53

16 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n best test set performance in W overhead for learning on a finite training set 13 / 53

17 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( dvc n ). 14 / 53

18 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance then: w S = arg min L S(w), w W E S P L(w S ) = min w W L(w) + O ( dvc n ). best test set performance in W overhead for learning on a finite training set 14 / 53

19 Some remarks on statistical learning theory The bounds are asymptotically tight. Typically d VC d, the number of parameters, so that: ( ) d L(w S ) = min L(w) + O. w W n Similar results (sometimes better) for losses other than 0/1. Holds for any distribution P. Improvements: data dependent bounds. The theory only tells you what happens for a given class W; choosing W is an art (domain knowledge). 15 / 53

20 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 16 / 53

21 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. 17 / 53

22 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? 17 / 53

23 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? we can! = online learning theory 17 / 53

24 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = / 53

25 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53

26 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53

27 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53

28 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53

29 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53

30 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53

31 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53

32 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53

33 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% / 53

34 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = / 53

35 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 19 / 53

36 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53

37 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53

38 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 19 / 53

39 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

40 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

41 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 19 / 53

42 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

43 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

44 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 19 / 53

45 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

46 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

47 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10%... 10% 80% 50% 10%... 20% 70% 50% 30%... 60% 30% 50% 80%... 30% 65% 50% 10% / 53

48 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. 20 / 53

49 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 20 / 53

50 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = / 53

51 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = 1.35 Regret of the learner: = The goal is to have small regret for any data sequence. 20 / 53

52 Online learning framework i i + 1 learner (strategy) w i : X Y prediction ŷ i = w i (x i ) suffered loss l(y i, ŷ i ) new instance (x i,?) feedback: y i 21 / 53

53 Online learning framework Set of strategies (actions) W; known loss function l. Learner starts with some initial strategy (action) w 1. For i = 1, 2,...: 1 Learner observes instance x i. 2 Learner predicts with ŷ i = w i (x i ). 3 The environment reveals outcome y i. 4 Learner suffers loss l i (w i ) = l(y i, ŷ i ). 5 Learner updates its strategy w i w i / 53

54 Online learning framework The goal of the learner is to be close to the best w in hindsight. Cumulative loss of the learner: ˆL n = n l i (w i ). i=1 Cumulative loss of the best strategy w in hindsight: n L n = min l i (w). w W Regret of the learner: i=1 R n = ˆL n L n. The goal is to minimize regret over all possible data sequences. 23 / 53

55 Prediction with expert advice N prediction strategies ( experts ) to follow. Instance x = (x 1,..., x N ): vector of experts predictions. Strategy w = (w 1,..., w N ): probability vector N k=1 w k = 1, w k 0 (learner s beliefs about experts, or probability of following a given expert). Learner s prediction: N ŷ = w(x) = w x = w k x k. k=1 Regret: competing with the best expert R n = ˆL n L n, L n = min k=1,...,n L n(k). where L n (k) is the loss of kth expert. 24 / 53

56 Applications of experts advice setting Combining advices from actual experts :-) Combining N learning algorithms/classifiers. Gambling (e.g., horse racing). Portfolio selection. Routing/shortest path. Spam filtering/document classification. Learning boolean formulae. Ranking/ordering / 53

57 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 26 / 53

58 Follow the leader Follow the leader (FTL) At iteration i, choose an expert which has the smallest loss on the past data: k min = arg min k=1,...,n j=1 i 1 l j (k) = arg min L i 1(k). k=1,...,n Corresponding strategy w: w kmin = 1, w k = 0 for k k min. 27 / 53

59 Failure of FTL expert loss / 53

60 Failure of FTL expert loss 75% 0 25% / 53

61 Failure of FTL expert loss 75% 0 25% 0 50% 0 28 / 53

62 Failure of FTL expert loss 75% % % / 53

63 Failure of FTL expert loss 75% 100% % 0% % / 53

64 Failure of FTL expert loss 75% 100% % 0% % 0% / 53

65 Failure of FTL expert loss 75% 100% % 0% % 0% / 53

66 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% / 53

67 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53

68 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53

69 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% / 53

70 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53

71 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53

72 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% / 53

73 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53

74 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53

75 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% / 53

76 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

77 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

78 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53

79 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53

80 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53

81 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n / 53

82 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. Learner must hedge its bets on experts! 28 / 53

83 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > / 53

84 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). w i+1,k = w i,ke ηl i(k) Z i, 29 / 53

85 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). Unwinding this update: w i+1,k = w i,ke ηl i(k) Z i, w i+1,k = e ηl i(k) Z i where L i (k) = j i l j(k) and Z i = N k=1 e ηl i(k). 29 / 53

86 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) 30 / 53

87 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. posterior probability w i+1,k data likelihood e ηl i(k) prior probability w i,k P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) normalization Z i 30 / 53

88 Hedge example (η = 2) 30% 10% 20% 60% 30% / 53

89 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53

90 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53

91 Hedge example (η = 2) 50% 80% 70% 30% 64% / 53

92 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53

93 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53

94 Hedge example (η = 2) 50% 50% 50% 50% 50% / 53

95 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53

96 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53

97 Hedge example (η = 2) 10% 10% 30% 80% 21% / 53

98 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53

99 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53

100 Hedge analysis Regret bound For any data sequence, when η = R n 8 log N n, n log N 2 Regret bound For any data sequence, when η = 2 log N L n, R n 2L n log N + log N Both bounds are tight. 32 / 53

101 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n 33 / 53

102 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n Essentially the same performance without i.i.d. assumption, but at the price of averaging! 33 / 53

103 Prediction with expert advice: Extensions Large (or countably infinite) class of experts. Concept drift: competing with the best sequence of experts. Competing with the best small set of recurring experts. Ranking: competing with the best permutation. Partial feedback: multi-armed bandits / 53

104 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. 35 / 53

105 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. Competing with the best small set of m recurring experts Mixture over all past Hedge weights. 35 / 53

106 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 36 / 53

107 Classification/regression framework Instances x: feature vectors. Outcome y: class/real output. Strategy w W: parameter vector: w = (w 1,..., w d ) R d. W can be R d or can be a regularization ball W = {w : w p B}. Prediction is linear: ŷ = w x. Loss l depends on the task we want to solve, but we assume l(y, ŷ) is convex in ŷ. 37 / 53

108 Examples Linear regression: y R and l(y, ŷ) = (y ŷ) 2. Logistic regression: y {0, 1} and ( l(y, ŷ) = log 1 + e yŷ). Support vector machines: y {0, 1} and l(y, ŷ) = (1 yŷ) + 38 / 53

109 Examples loss hinge (SVM) square logistic prediction Logistic and hinge losses plotted for y = 1. Squared error loss plotted for y = / 53

110 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). Source: larry/, 40 / 53

111 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). If we have a set of constraints w W, after each step we need to project back to W: w i+1 := arg min w W w i+1 w 2. Source: larry/, 40 / 53

112 Online (stochastic) gradient descent Algorithm Start with any initial vector w 1 W. For i = 1, 2,...: 1 Observe input vector x i. 2 Predict with ŷ i = w i x i. 3 Outcome y i is revealed. 4 Suffer loss l i (w i ) = l(y i, ŷ i ). 5 Push the weight vector toward the negative gradient of the loss: w i+1 := w i η i wi l i (w i ). 6 If w i+1 / W, project it back to W: w i+1 := arg min w W w i+1 w / 53

113 Online vs. standard gradient descent The function we want to minimize: n f(w) = l j (w). j=1 Standard = batch GD w i+1 : = w i η i wi f(w i ) = w i η i wi l j (w i ) O(n) per iteration, need to see all data. j Online GD w i+1 := w i η i wi l i (w i ). O(1) per iteration, need to see a single data point. 42 / 53

114 Online (stochastic) gradient descent W w 1 43 / 53

115 Online (stochastic) gradient descent η w1 l 1 (w 1 ) W w 1 43 / 53

116 Online (stochastic) gradient descent w 2 η w1 l 1 (w 1 ) W w 1 43 / 53

117 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 1 43 / 53

118 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 3 w 1 43 / 53

119 Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53

120 Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53

121 Online (stochastic) gradient descent projection w 4 w 2 W w 3 w 1 43 / 53

122 Calculating the gradient The gradient wi l i (w i ) can be obtained by applying a chain rule: l i (w) = l(y i, w x i ) w k w k l(y, ŷ) (w x i ) = ŷ=w ŷ x i w k l(y, ŷ) = ŷ=w x ik. ŷ x i If we denote l l(y,ŷ) i (w) := ŷ=w, then we can write x i ŷ wi l i (w i ) = l i(w i )x i 44 / 53

123 Update rules for specific losses Linear regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) Update: w i+1 = w i + 2η i (y i ŷ i )x i. 45 / 53

124 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i 45 / 53

125 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 45 / 53

126 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 perceptron! 45 / 53

127 Projection w i := arg min w W w i w / 53

128 Projection w i := arg min w W w i w 2. When W = R d no projection step. 46 / 53

129 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. 46 / 53

130 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. When W = {w : d k=1 w k B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0. Equivalent to L 1 regularization, results in sparse solutions. 46 / 53

131 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 47 / 53

132 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 projected w i projected w i 47 / 53

133 Convergence of online gradient descent Theorem Let 0 W. Assume w l i (w) L and let W = max w W w. Then, with w 1 = 0, η i = 1 W i L regret is bounded by: the R n W L 2n, so that the per-iteration regret: 1 2 n R n W L n. 48 / 53

134 Exponentiated Gradient [Kivinen & Warmuth 1996] Gradient descent w i+1 := w i η i wi l i (w i ). Exponentiated descent w i+1 := 1 Z i w i e η i wi l i (w i ). Direct extension of Hedge for classification/regression framework. Requires positive weights, but can be applied in a general setting using doubling trick. Works much better than online gradient descent when: d is very large (many features) only a small number of features is relevant. Works very well when the best model is sparse, but does not keep sparse solution itself! 49 / 53

135 Convergence of exponentiated gradient Theorem Let W be a positive orthant of L 1 -ball with radius W, and assume w l i (w) L. Then, with w 1 = ( 1 d,..., d) 1, η i = 1 W 8 log d i L the regret is bounded by: n log d R n W L, 2 so that the per-iteration regret: 1 log d n R n W L 2n. 50 / 53

136 Extensions Concept drift: competing with drifting parameter vectors. Partial feedback: contextual multi-armed bandit problems. Improvements for some (strongly convex, exp-concave) loss functions. Infinite-dimensional feature spaces via kernel trick. Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices) / 53

137 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 52 / 53

138 Conclusions A theoretical framework for learning without i.i.d. assumption. Performance bounds often simpler to prove than in the stochastic setting. Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial information (bandits), etc. Results in online algorithms directly applicable to large-scale learning problems. Most of currently used offline learning algorithms employ online learning as an optimization routine. 53 / 53

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass. Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,

More information

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

CS229T/STAT231: Statistical Learning Theory (Winter 2015) CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Trading regret rate for computational efficiency in online learning with limited feedback

Trading regret rate for computational efficiency in online learning with limited feedback Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

The p-norm generalization of the LMS algorithm for adaptive filtering

The p-norm generalization of the LMS algorithm for adaptive filtering The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Online Classification on a Budget

Online Classification on a Budget Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

1 Portfolio Selection

1 Portfolio Selection COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

The Advantages and Disadvantages of Online Linear Optimization

The Advantages and Disadvantages of Online Linear Optimization LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA

More information

Discrete Optimization

Discrete Optimization Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using

More information

Online Algorithms: Learning & Optimization with No Regret.

Online Algorithms: Learning & Optimization with No Regret. Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Gambling and Data Compression

Gambling and Data Compression Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Machine learning and optimization for massive data

Machine learning and optimization for massive data Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Online Semi-Supervised Learning

Online Semi-Supervised Learning Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

More information

Vowpal Wabbit 7 Tutorial. Stephane Ross

Vowpal Wabbit 7 Tutorial. Stephane Ross Vowpal Wabbit 7 Tutorial Stephane Ross Binary Classification and Regression Input Format Data in text file (can be gziped), one example/line Label [weight] Namespace Feature... Feature Namespace... Namespace:

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient Journal of Machine Learning Research 0 (2009) 777-80 Submitted 6/08; Revised /08; Published 3/09 Sparse Online Learning via runcated Gradient John Langford Yahoo! Research New York, NY, USA Lihong Li Department

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

More information

Sibyl: a system for large scale machine learning

Sibyl: a system for large scale machine learning Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning

More information

On Adaboost and Optimal Betting Strategies

On Adaboost and Optimal Betting Strategies On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

DUOL: A Double Updating Approach for Online Learning

DUOL: A Double Updating Approach for Online Learning : A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

More information

Online Learning with Switching Costs and Other Adaptive Adversaries

Online Learning with Switching Costs and Other Adaptive Adversaries Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

Optimal Strategies and Minimax Lower Bounds for Online Convex Games

Optimal Strategies and Minimax Lower Bounds for Online Convex Games Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems

More information

Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

More information

Prediction with Expert Advice for the Brier Game

Prediction with Expert Advice for the Brier Game Vladimir Vovk vovk@csrhulacuk Fedor Zhdanov fedor@csrhulacuk Computer Learning Research Centre, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Abstract

More information

4 Learning, Regret minimization, and Equilibria

4 Learning, Regret minimization, and Equilibria 4 Learning, Regret minimization, and Equilibria A. Blum and Y. Mansour Abstract Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Strongly Adaptive Online Learning

Strongly Adaptive Online Learning Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance

More information

Warm Start for Parameter Selection of Linear Classifiers

Warm Start for Parameter Selection of Linear Classifiers Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Dept. of Computer Science National Taiwan Univ., Taiwan r222247@ntu.edu.tw Chieh-Yen Lin Dept. of Computer Science National Taiwan Univ.,

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda

More information

The Multiplicative Weights Update method

The Multiplicative Weights Update method Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Classification using Logistic Regression

Classification using Logistic Regression Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

More information

Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

More information

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

How to assess the risk of a large portfolio? How to estimate a large covariance matrix? Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information