Introduction to Online Learning Theory
|
|
- Joel Perkins
- 8 years ago
- Views:
Transcription
1 Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, / 53
2 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 2 / 53
3 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 3 / 53
4 Text document classification [Bottou, 2007] Data set: Reuters RCV1: Set of documents from Reuters News published during training examples, testing examples TF-IDF features. For each document, some topics (categories) were assigned. For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial). 4 / 53
5 Document example <?xml version="1.0" encoding="iso "?> <newsitem itemid="2330" id="root" date=" " xml:lang="en"> <title>usa: Tylan stock jumps; weighs sale of company.</title> <headline>tylan stock jumps; weighs sale of company.</headline> <dateline>san DIEGO</dateline> <text> <p>the stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.</p> <p>tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.</p> <p>the company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.</p> </text> <copyright>(c) Reuters Limited 1996</copyright> <metadata> <codes class="bip:countries:1.0"> <code code="usa"> </code> </codes> <codes class="bip:industries:1.0"> <code code="i34420"> </code> </codes> <codes class="bip:topics:1.0"> <code code="c15"> </code> <code code="c152"> </code> <code code="c18"> </code> <code code="c181"> </code> <code code="ccat"> </code> </codes> <dc element="dc.publisher" value="reuters Holdings Plc"/> <dc element="dc.date.published" value=" "/> <dc element="dc.source" value="reuters"/> 5 / 53
6 Experiment Two types of loss functions: Logistic loss (logistic regression) Hinge loss (SVM) Two types of learning: Standard (batch setting): minimization of the (regularized) empirical risk on the training data. Online gradient descent (stochastic gradient descent) L 2 regularization. Source: 6 / 53
7 Results Hinge loss method comp. time training error test error SVMLight (batch) sec % SVMPerf (batch) 66 sec % SGD 1.4 sec % Logistic loss method comp. time training error test error LibLinear (batch) 30 sec % SGD 2.3 sec % 7 / 53
8 Some more examples hinge loss Data set #objects #features time LIBSVM time SGD Reuters days 7sec Translation many days 7sec SuperTag h 1sec Voicetone h 1sec 8 / 53
9 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 9 / 53
10 A typical scheme in machine learning learning algorithm function (classifier) w S : X Y predictions ŷ 5 = w S (x 5 ) ŷ 6 = w S (x 6 ) ŷ 7 = w S (x 7 ) accuracy l(y 5, ŷ 5 ) l(y 5, ŷ 6 ) l(y 7, ŷ 7 ) (x 1, y 1 ) (x2, y 2 ) (x 5,?) (x6,?) training set S test set T feedback (x 3, y 3 ) (x4, y 4 ) (x 7,?) y 5, y 6, y 7 10 / 53
11 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) 11 / 53
12 Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) No reasonable answer without any assumptions ( no free lunch ). Training data and test data must be in some sense similar. 11 / 53
13 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? 12 / 53
14 Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? = Empirical risk minimization. 12 / 53
15 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n 13 / 53
16 Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n best test set performance in W overhead for learning on a finite training set 13 / 53
17 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( dvc n ). 14 / 53
18 Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance then: w S = arg min L S(w), w W E S P L(w S ) = min w W L(w) + O ( dvc n ). best test set performance in W overhead for learning on a finite training set 14 / 53
19 Some remarks on statistical learning theory The bounds are asymptotically tight. Typically d VC d, the number of parameters, so that: ( ) d L(w S ) = min L(w) + O. w W n Similar results (sometimes better) for losses other than 0/1. Holds for any distribution P. Improvements: data dependent bounds. The theory only tells you what happens for a given class W; choosing W is an art (domain knowledge). 15 / 53
20 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 16 / 53
21 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. 17 / 53
22 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? 17 / 53
23 Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? we can! = online learning theory 17 / 53
24 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = / 53
25 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53
26 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 18 / 53
27 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53
28 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 18 / 53
29 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53
30 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 18 / 53
31 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53
32 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% 18 / 53
33 Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = % 25% 10% 25% / 53
34 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = / 53
35 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 19 / 53
36 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53
37 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 10% 20% 60% 30% 19 / 53
38 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 19 / 53
39 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53
40 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53
41 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 19 / 53
42 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53
43 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53
44 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 19 / 53
45 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53
46 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53
47 Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = % 50% 50% 10%... 10% 80% 50% 10%... 20% 70% 50% 30%... 60% 30% 50% 80%... 30% 65% 50% 10% / 53
48 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. 20 / 53
49 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 20 / 53
50 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = / 53
51 Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert cumulative loss 30% 50% 50% 10% = % 80% 50% 10% = % 70% 50% 30% = % 30% 50% 80% = % 65% 50% 10% = 1.35 Regret of the learner: = The goal is to have small regret for any data sequence. 20 / 53
52 Online learning framework i i + 1 learner (strategy) w i : X Y prediction ŷ i = w i (x i ) suffered loss l(y i, ŷ i ) new instance (x i,?) feedback: y i 21 / 53
53 Online learning framework Set of strategies (actions) W; known loss function l. Learner starts with some initial strategy (action) w 1. For i = 1, 2,...: 1 Learner observes instance x i. 2 Learner predicts with ŷ i = w i (x i ). 3 The environment reveals outcome y i. 4 Learner suffers loss l i (w i ) = l(y i, ŷ i ). 5 Learner updates its strategy w i w i / 53
54 Online learning framework The goal of the learner is to be close to the best w in hindsight. Cumulative loss of the learner: ˆL n = n l i (w i ). i=1 Cumulative loss of the best strategy w in hindsight: n L n = min l i (w). w W Regret of the learner: i=1 R n = ˆL n L n. The goal is to minimize regret over all possible data sequences. 23 / 53
55 Prediction with expert advice N prediction strategies ( experts ) to follow. Instance x = (x 1,..., x N ): vector of experts predictions. Strategy w = (w 1,..., w N ): probability vector N k=1 w k = 1, w k 0 (learner s beliefs about experts, or probability of following a given expert). Learner s prediction: N ŷ = w(x) = w x = w k x k. k=1 Regret: competing with the best expert R n = ˆL n L n, L n = min k=1,...,n L n(k). where L n (k) is the loss of kth expert. 24 / 53
56 Applications of experts advice setting Combining advices from actual experts :-) Combining N learning algorithms/classifiers. Gambling (e.g., horse racing). Portfolio selection. Routing/shortest path. Spam filtering/document classification. Learning boolean formulae. Ranking/ordering / 53
57 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 26 / 53
58 Follow the leader Follow the leader (FTL) At iteration i, choose an expert which has the smallest loss on the past data: k min = arg min k=1,...,n j=1 i 1 l j (k) = arg min L i 1(k). k=1,...,n Corresponding strategy w: w kmin = 1, w k = 0 for k k min. 27 / 53
59 Failure of FTL expert loss / 53
60 Failure of FTL expert loss 75% 0 25% / 53
61 Failure of FTL expert loss 75% 0 25% 0 50% 0 28 / 53
62 Failure of FTL expert loss 75% % % / 53
63 Failure of FTL expert loss 75% 100% % 0% % / 53
64 Failure of FTL expert loss 75% 100% % 0% % 0% / 53
65 Failure of FTL expert loss 75% 100% % 0% % 0% / 53
66 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% / 53
67 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53
68 Failure of FTL expert loss 75% 100% 100% % 0% 0% % 0% 100% / 53
69 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% / 53
70 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53
71 Failure of FTL expert loss 75% 100% 100% 100% % 0% 0% 0% % 0% 100% 0% / 53
72 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% / 53
73 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53
74 Failure of FTL expert loss 75% 100% 100% 100% 100% % 0% 0% 0% 0% % 0% 100% 0% 100% / 53
75 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% / 53
76 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53
77 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53
78 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% / 53
79 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53
80 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% / 53
81 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n / 53
82 Failure of FTL expert loss 75% 100% 100% 100% 100% 100% 100% % 0% 0% 0% 0% 0% 0% % 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. Learner must hedge its bets on experts! 28 / 53
83 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > / 53
84 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). w i+1,k = w i,ke ηl i(k) Z i, 29 / 53
85 Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). Unwinding this update: w i+1,k = w i,ke ηl i(k) Z i, w i+1,k = e ηl i(k) Z i where L i (k) = j i l j(k) and Z i = N k=1 e ηl i(k). 29 / 53
86 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) 30 / 53
87 Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. posterior probability w i+1,k data likelihood e ηl i(k) prior probability w i,k P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) normalization Z i 30 / 53
88 Hedge example (η = 2) 30% 10% 20% 60% 30% / 53
89 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53
90 Hedge example (η = 2) 1 30% 10% 20% 60% 30% / 53
91 Hedge example (η = 2) 50% 80% 70% 30% 64% / 53
92 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53
93 Hedge example (η = 2) 1 50% 80% 70% 30% 64% / 53
94 Hedge example (η = 2) 50% 50% 50% 50% 50% / 53
95 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53
96 Hedge example (η = 2) 1 50% 50% 50% 50% 50% / 53
97 Hedge example (η = 2) 10% 10% 30% 80% 21% / 53
98 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53
99 Hedge example (η = 2) 1 10% 10% 30% 80% 21% / 53
100 Hedge analysis Regret bound For any data sequence, when η = R n 8 log N n, n log N 2 Regret bound For any data sequence, when η = 2 log N L n, R n 2L n log N + log N Both bounds are tight. 32 / 53
101 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n 33 / 53
102 Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n Essentially the same performance without i.i.d. assumption, but at the price of averaging! 33 / 53
103 Prediction with expert advice: Extensions Large (or countably infinite) class of experts. Concept drift: competing with the best sequence of experts. Competing with the best small set of recurring experts. Ranking: competing with the best permutation. Partial feedback: multi-armed bandits / 53
104 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. 35 / 53
105 Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. Competing with the best small set of m recurring experts Mixture over all past Hedge weights. 35 / 53
106 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 36 / 53
107 Classification/regression framework Instances x: feature vectors. Outcome y: class/real output. Strategy w W: parameter vector: w = (w 1,..., w d ) R d. W can be R d or can be a regularization ball W = {w : w p B}. Prediction is linear: ŷ = w x. Loss l depends on the task we want to solve, but we assume l(y, ŷ) is convex in ŷ. 37 / 53
108 Examples Linear regression: y R and l(y, ŷ) = (y ŷ) 2. Logistic regression: y {0, 1} and ( l(y, ŷ) = log 1 + e yŷ). Support vector machines: y {0, 1} and l(y, ŷ) = (1 yŷ) + 38 / 53
109 Examples loss hinge (SVM) square logistic prediction Logistic and hinge losses plotted for y = 1. Squared error loss plotted for y = / 53
110 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). Source: larry/, 40 / 53
111 Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). If we have a set of constraints w W, after each step we need to project back to W: w i+1 := arg min w W w i+1 w 2. Source: larry/, 40 / 53
112 Online (stochastic) gradient descent Algorithm Start with any initial vector w 1 W. For i = 1, 2,...: 1 Observe input vector x i. 2 Predict with ŷ i = w i x i. 3 Outcome y i is revealed. 4 Suffer loss l i (w i ) = l(y i, ŷ i ). 5 Push the weight vector toward the negative gradient of the loss: w i+1 := w i η i wi l i (w i ). 6 If w i+1 / W, project it back to W: w i+1 := arg min w W w i+1 w / 53
113 Online vs. standard gradient descent The function we want to minimize: n f(w) = l j (w). j=1 Standard = batch GD w i+1 : = w i η i wi f(w i ) = w i η i wi l j (w i ) O(n) per iteration, need to see all data. j Online GD w i+1 := w i η i wi l i (w i ). O(1) per iteration, need to see a single data point. 42 / 53
114 Online (stochastic) gradient descent W w 1 43 / 53
115 Online (stochastic) gradient descent η w1 l 1 (w 1 ) W w 1 43 / 53
116 Online (stochastic) gradient descent w 2 η w1 l 1 (w 1 ) W w 1 43 / 53
117 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 1 43 / 53
118 Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 3 w 1 43 / 53
119 Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53
120 Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53
121 Online (stochastic) gradient descent projection w 4 w 2 W w 3 w 1 43 / 53
122 Calculating the gradient The gradient wi l i (w i ) can be obtained by applying a chain rule: l i (w) = l(y i, w x i ) w k w k l(y, ŷ) (w x i ) = ŷ=w ŷ x i w k l(y, ŷ) = ŷ=w x ik. ŷ x i If we denote l l(y,ŷ) i (w) := ŷ=w, then we can write x i ŷ wi l i (w i ) = l i(w i )x i 44 / 53
123 Update rules for specific losses Linear regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) Update: w i+1 = w i + 2η i (y i ŷ i )x i. 45 / 53
124 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i 45 / 53
125 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 45 / 53
126 Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 perceptron! 45 / 53
127 Projection w i := arg min w W w i w / 53
128 Projection w i := arg min w W w i w 2. When W = R d no projection step. 46 / 53
129 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. 46 / 53
130 Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. When W = {w : d k=1 w k B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0. Equivalent to L 1 regularization, results in sparse solutions. 46 / 53
131 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 47 / 53
132 L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 projected w i projected w i 47 / 53
133 Convergence of online gradient descent Theorem Let 0 W. Assume w l i (w) L and let W = max w W w. Then, with w 1 = 0, η i = 1 W i L regret is bounded by: the R n W L 2n, so that the per-iteration regret: 1 2 n R n W L n. 48 / 53
134 Exponentiated Gradient [Kivinen & Warmuth 1996] Gradient descent w i+1 := w i η i wi l i (w i ). Exponentiated descent w i+1 := 1 Z i w i e η i wi l i (w i ). Direct extension of Hedge for classification/regression framework. Requires positive weights, but can be applied in a general setting using doubling trick. Works much better than online gradient descent when: d is very large (many features) only a small number of features is relevant. Works very well when the best model is sparse, but does not keep sparse solution itself! 49 / 53
135 Convergence of exponentiated gradient Theorem Let W be a positive orthant of L 1 -ball with radius W, and assume w l i (w) L. Then, with w 1 = ( 1 d,..., d) 1, η i = 1 W 8 log d i L the regret is bounded by: n log d R n W L, 2 so that the per-iteration regret: 1 log d n R n W L 2n. 50 / 53
136 Extensions Concept drift: competing with drifting parameter vectors. Partial feedback: contextual multi-armed bandit problems. Improvements for some (strongly convex, exp-concave) loss functions. Infinite-dimensional feature spaces via kernel trick. Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices) / 53
137 Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 52 / 53
138 Conclusions A theoretical framework for learning without i.i.d. assumption. Performance bounds often simpler to prove than in the stochastic setting. Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial information (bandits), etc. Results in online algorithms directly applicable to large-scale learning problems. Most of currently used offline learning algorithms employ online learning as an optimization routine. 53 / 53
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
More informationA Potential-based Framework for Online Multi-class Learning with Partial Feedback
A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationlarge-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationTable 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,
More informationCS229T/STAT231: Statistical Learning Theory (Winter 2015)
CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you
More informationLarge-Scale Machine Learning with Stochastic Gradient Descent
Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationTrading regret rate for computational efficiency in online learning with limited feedback
Trading regret rate for computational efficiency in online learning with limited feedback Shai Shalev-Shwartz TTI-C Hebrew University On-line Learning with Limited Feedback Workshop, 2009 June 2009 Shai
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationThe p-norm generalization of the LMS algorithm for adaptive filtering
The p-norm generalization of the LMS algorithm for adaptive filtering Jyrki Kivinen University of Helsinki Manfred Warmuth University of California, Santa Cruz Babak Hassibi California Institute of Technology
More informationANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationSparse Online Learning via Truncated Gradient
Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationOnline Classification on a Budget
Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk
More informationLecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More information1 Portfolio Selection
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationSupport Vector Machine. Tutorial. (and Statistical Learning Theory)
Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationThe Advantages and Disadvantages of Online Linear Optimization
LINEAR PROGRAMMING WITH ONLINE LEARNING TATSIANA LEVINA, YURI LEVIN, JEFF MCGILL, AND MIKHAIL NEDIAK SCHOOL OF BUSINESS, QUEEN S UNIVERSITY, 143 UNION ST., KINGSTON, ON, K7L 3N6, CANADA E-MAIL:{TLEVIN,YLEVIN,JMCGILL,MNEDIAK}@BUSINESS.QUEENSU.CA
More informationDiscrete Optimization
Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using
More informationOnline Algorithms: Learning & Optimization with No Regret.
Online Algorithms: Learning & Optimization with No Regret. Daniel Golovin 1 The Setup Optimization: Model the problem (objective, constraints) Pick best decision from a feasible set. Learning: Model the
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationAdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationGambling and Data Compression
Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction
More informationBig learning: challenges and opportunities
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,
More informationMachine learning and optimization for massive data
Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationOnline Semi-Supervised Learning
Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised
More informationIntroduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationLCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
More informationStatistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
More informationMachine learning challenges for big data
Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationOnline Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes
Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online
More informationVowpal Wabbit 7 Tutorial. Stephane Ross
Vowpal Wabbit 7 Tutorial Stephane Ross Binary Classification and Regression Input Format Data in text file (can be gziped), one example/line Label [weight] Namespace Feature... Feature Namespace... Namespace:
More informationOnline Convex Optimization
E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting
More informationSparse Online Learning via Truncated Gradient
Journal of Machine Learning Research 0 (2009) 777-80 Submitted 6/08; Revised /08; Published 3/09 Sparse Online Learning via runcated Gradient John Langford Yahoo! Research New York, NY, USA Lihong Li Department
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationMaster s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationSibyl: a system for large scale machine learning
Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning
More informationOn Adaboost and Optimal Betting Strategies
On Adaboost and Optimal Betting Strategies Pasquale Malacaria School of Electronic Engineering and Computer Science Queen Mary, University of London Email: pm@dcs.qmul.ac.uk Fabrizio Smeraldi School of
More informationSupport Vector Machines
Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationDUOL: A Double Updating Approach for Online Learning
: A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University
More informationOnline Learning with Switching Costs and Other Adaptive Adversaries
Online Learning with Switching Costs and Other Adaptive Adversaries Nicolò Cesa-Bianchi Università degli Studi di Milano Italy Ofer Dekel Microsoft Research USA Ohad Shamir Microsoft Research and the Weizmann
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
More informationStochastic Optimization for Big Data Analytics: Algorithms and Libraries
Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State
More information10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
More informationOptimal Strategies and Minimax Lower Bounds for Online Convex Games
Optimal Strategies and Minimax Lower Bounds for Online Convex Games Jacob Abernethy UC Berkeley jake@csberkeleyedu Alexander Rakhlin UC Berkeley rakhlin@csberkeleyedu Abstract A number of learning problems
More informationFederated Optimization: Distributed Optimization Beyond the Datacenter
Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103
More informationPrediction with Expert Advice for the Brier Game
Vladimir Vovk vovk@csrhulacuk Fedor Zhdanov fedor@csrhulacuk Computer Learning Research Centre, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK Abstract
More information4 Learning, Regret minimization, and Equilibria
4 Learning, Regret minimization, and Equilibria A. Blum and Y. Mansour Abstract Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive
More informationIntroduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
More informationStrongly Adaptive Online Learning
Amit Daniely Alon Gonen Shai Shalev-Shwartz The Hebrew University AMIT.DANIELY@MAIL.HUJI.AC.IL ALONGNN@CS.HUJI.AC.IL SHAIS@CS.HUJI.AC.IL Abstract Strongly adaptive algorithms are algorithms whose performance
More informationWarm Start for Parameter Selection of Linear Classifiers
Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Dept. of Computer Science National Taiwan Univ., Taiwan r222247@ntu.edu.tw Chieh-Yen Lin Dept. of Computer Science National Taiwan Univ.,
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationMachine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
More informationLecture 8 February 4
ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt
More informationSteven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore
Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda
More informationThe Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
More informationLarge-Scale Similarity and Distance Metric Learning
Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More information7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
More informationClassification using Logistic Regression
Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationHow to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
More informationGI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
More informationBIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More information