Introduction to Online Learning Theory



Similar documents
Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Linear Threshold Units

Simple and efficient online algorithms for real world applications

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Big Data Analytics CSCI 4030

Big Data Analytics: Optimization and Randomization

Lecture 3: Linear methods for classification

Beyond stochastic gradient descent for large-scale machine learning

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Large-Scale Machine Learning with Stochastic Gradient Descent

Machine Learning and Pattern Recognition Logistic Regression

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Logistic Regression for Spam Filtering

The p-norm generalization of the LMS algorithm for adaptive filtering

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Linear Classification. Volker Tresp Summer 2015

Data Mining - Evaluation of Classifiers

Big Data - Lecture 1 Optimization reminders

CSCI567 Machine Learning (Fall 2014)

Supervised Learning (Big Data Analytics)

Lecture 6: Logistic Regression

Big Data Analytics. Lucas Rego Drumond

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 2: The SVM classifier

1 Portfolio Selection

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

The Advantages and Disadvantages of Online Linear Optimization

Discrete Optimization

Online Algorithms: Learning & Optimization with No Regret.

An Introduction to Machine Learning

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

STA 4273H: Statistical Machine Learning

Gambling and Data Compression

Big learning: challenges and opportunities

Machine learning and optimization for massive data

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Online Semi-Supervised Learning

Introduction to Logistic Regression

LCs for Binary Classification

Statistical machine learning, high dimension and big data

Machine learning challenges for big data

Support Vector Machines Explained

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Vowpal Wabbit 7 Tutorial. Stephane Ross

Online Convex Optimization

Sparse Online Learning via Truncated Gradient

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Master s Theory Exam Spring 2006

Server Load Prediction

Week 1: Introduction to Online Learning

Sibyl: a system for large scale machine learning

On Adaboost and Optimal Betting Strategies

Support Vector Machines

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Logistic Regression (1/24/13)

DUOL: A Double Updating Approach for Online Learning

Notes from Week 1: Algorithms for sequential prediction

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Machine Learning.

Federated Optimization: Distributed Optimization Beyond the Datacenter

4 Learning, Regret minimization, and Equilibria

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Strongly Adaptive Online Learning

Warm Start for Parameter Selection of Linear Classifiers

The Artificial Prediction Market

A Simple Introduction to Support Vector Machines

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Lecture 8 February 4

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

The Multiplicative Weights Update method

Large-Scale Similarity and Distance Metric Learning

Distributed Machine Learning and Big Data

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

7 Gaussian Elimination and LU Factorization

Classification using Logistic Regression

Probabilistic user behavior models in online stores for recommender systems

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

GI01/M055 Supervised Learning Proximal Methods

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Making Sense of the Mayhem: Machine Learning and March Madness

Adaptive Online Gradient Descent

Data Mining Practical Machine Learning Tools and Techniques

Transcription:

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 2 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 3 / 53

Text document classification [Bottou, 2007] Data set: Reuters RCV1: Set of 800 000 documents from Reuters News published during 1996-1997 781 265 training examples, 23 149 testing examples. 47 152 TF-IDF features. For each document, some topics (categories) were assigned. For the sake of illustration, the problem is simplified to binary classification by predicting whether the document belongs to category CCAT (Corporate/Industrial). 4 / 53

Document example <?xml version="1.0" encoding="iso-8859-1"?> <newsitem itemid="2330" id="root" date="1996-08-20" xml:lang="en"> <title>usa: Tylan stock jumps; weighs sale of company.</title> <headline>tylan stock jumps; weighs sale of company.</headline> <dateline>san DIEGO</dateline> <text> <p>the stock of Tylan General Inc. jumped Tuesday after the maker of process-management equipment said it is exploring the sale of the company and added that it has already received some inquiries from potential buyers.</p> <p>tylan was up $2.50 to $12.75 in early trading on the Nasdaq market.</p> <p>the company said it has set up a committee of directors to oversee the sale and that Goldman, Sachs & Co. has been retained as its financial adviser.</p> </text> <copyright>(c) Reuters Limited 1996</copyright> <metadata> <codes class="bip:countries:1.0"> <code code="usa"> </code> </codes> <codes class="bip:industries:1.0"> <code code="i34420"> </code> </codes> <codes class="bip:topics:1.0"> <code code="c15"> </code> <code code="c152"> </code> <code code="c18"> </code> <code code="c181"> </code> <code code="ccat"> </code> </codes> <dc element="dc.publisher" value="reuters Holdings Plc"/> <dc element="dc.date.published" value="1996-08-20"/> <dc element="dc.source" value="reuters"/> 5 / 53

Experiment Two types of loss functions: Logistic loss (logistic regression) Hinge loss (SVM) Two types of learning: Standard (batch setting): minimization of the (regularized) empirical risk on the training data. Online gradient descent (stochastic gradient descent) L 2 regularization. Source: http://leon.bottou.org/projects/sgd 6 / 53

Results Hinge loss method comp. time training error test error SVMLight (batch) 23 642 sec 0.2275 6.02% SVMPerf (batch) 66 sec 0.2278 6.03% SGD 1.4 sec 0.2275 6.02% Logistic loss method comp. time training error test error LibLinear (batch) 30 sec 0.18907 5.68% SGD 2.3 sec 0.18893 5.66% 7 / 53

Some more examples hinge loss Data set #objects #features time LIBSVM time SGD Reuters 781 000 47 000 2.5 days 7sec Translation 1 000 000 274 000 many days 7sec SuperTag 950 000 46 000 8h 1sec Voicetone 579 000 88 000 10h 1sec 8 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 9 / 53

A typical scheme in machine learning learning algorithm function (classifier) w S : X Y predictions ŷ 5 = w S (x 5 ) ŷ 6 = w S (x 6 ) ŷ 7 = w S (x 7 ) accuracy l(y 5, ŷ 5 ) l(y 5, ŷ 6 ) l(y 7, ŷ 7 ) (x 1, y 1 ) (x2, y 2 ) (x 5,?) (x6,?) training set S test set T feedback (x 3, y 3 ) (x4, y 4 ) (x 7,?) y 5, y 6, y 7 10 / 53

Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) 11 / 53

Test set performance Ultimate question of machine learning theory Given a training set S = {(x i, y i )} n i=1, how to learn a function w S : X Y, so that total loss on a separate test set T = {(x i, y i )} m i=n+1 : is minimized? m i=n+1 l(y i, w S (x i )) No reasonable answer without any assumptions ( no free lunch ). Training data and test data must be in some sense similar. 11 / 53

Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? 12 / 53

Statistical learning theory Assumption Training data and test data were independently generated from the same distribution P (i.i.d.). Mean training error of w (empirical risk): L S (w) = 1 n n l(y i, w(x i )). i=1 Test error = expected error of w (risk): L(w) = E (x,y) P [l(y, w(x))]. Given training data S, how to construct w S to make L(w S ) as small as possible? = Empirical risk minimization. 12 / 53

Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n 13 / 53

Generalization bounds Theorem for finite classes (0/1 loss) [Occam s Razor] Let the class of functions W be finite. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( ) log W. n best test set performance in W overhead for learning on a finite training set 13 / 53

Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: w S = arg min w W L S(w), then: E S P L(w S ) = min w W L(w) + O ( dvc n ). 14 / 53

Generalization bounds Theorem for VC classes (0/1 loss) [Vapnik-Chervonenkis, 1971] Let W has VC dimension d VC. If the classifier w S was trained on S by minimizing the empirical risk within W: our test set performance then: w S = arg min L S(w), w W E S P L(w S ) = min w W L(w) + O ( dvc n ). best test set performance in W overhead for learning on a finite training set 14 / 53

Some remarks on statistical learning theory The bounds are asymptotically tight. Typically d VC d, the number of parameters, so that: ( ) d L(w S ) = min L(w) + O. w W n Similar results (sometimes better) for losses other than 0/1. Holds for any distribution P. Improvements: data dependent bounds. The theory only tells you what happens for a given class W; choosing W is an art (domain knowledge). 15 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 16 / 53

Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. 17 / 53

Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? 17 / 53

Alternative view on learning Stochastic (i.i.d.) assumption often criticized, sometimes clearly invalid (e.g. time series). Learning process by its very nature is incremental. We do not observe the distributions, we only see the data. Motivation Can we relax the i.i.d. assumption and treat the data generating process as completely arbitrary? Can we obtain performance guarantees solely based on observed quantities? we can! = online learning theory 17 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25% 18 / 53

Example: weather prediction (rain/sunny) i = 1 i = 2 i = 3 i = 4... 50% 25% 10% 25%...... 18 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 30% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 10% 20% 60% 30% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 10% 80% 20% 70% 60% 30% 30% 65% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 80% 50% 20% 70% 50% 60% 30% 50% 30% 65% 50% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 19 / 53

Example: weather prediction (rain/sunny) expert i = 1 i = 2 i = 3 i = 4... 30% 50% 50% 10%... 10% 80% 50% 10%... 20% 70% 50% 30%... 60% 30% 50% 80%... 30% 65% 50% 10%...... 19 / 53

Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. 20 / 53

Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 10% 80% 50% 10% 20% 70% 50% 30% 60% 30% 50% 80% 30% 65% 50% 10% 20 / 53

Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4 10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9 20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3 60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6 30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35 20 / 53

Example: weather prediction (rain/sunny) Prediction accuracy evaluated by a loss function, e.g.: l(y i, ŷ i ) = y i ŷ i. Total performance evaluated by a regret: learner s cumulative loss minus cumulative loss of the best expert in hindsight. expert 1 2 3 4 cumulative loss 30% 50% 50% 10% 0.3 + 0.5 + 0.5 + 0.1 = 1.4 10% 80% 50% 10% 0.1 + 0.2 + 0.5 + 0.1 = 0.9 20% 70% 50% 30% 0.2 + 0.3 + 0.5 + 0.3 = 1.3 60% 30% 50% 80% 0.6 + 0.7 + 0.5 + 0.8 = 2.6 30% 65% 50% 10% 0.3 + 0.45 + 0.5 + 0.1 = 1.35 Regret of the learner: 1.35 0.9 = 0.45. The goal is to have small regret for any data sequence. 20 / 53

Online learning framework i i + 1 learner (strategy) w i : X Y prediction ŷ i = w i (x i ) suffered loss l(y i, ŷ i ) new instance (x i,?) feedback: y i 21 / 53

Online learning framework Set of strategies (actions) W; known loss function l. Learner starts with some initial strategy (action) w 1. For i = 1, 2,...: 1 Learner observes instance x i. 2 Learner predicts with ŷ i = w i (x i ). 3 The environment reveals outcome y i. 4 Learner suffers loss l i (w i ) = l(y i, ŷ i ). 5 Learner updates its strategy w i w i+1. 22 / 53

Online learning framework The goal of the learner is to be close to the best w in hindsight. Cumulative loss of the learner: ˆL n = n l i (w i ). i=1 Cumulative loss of the best strategy w in hindsight: n L n = min l i (w). w W Regret of the learner: i=1 R n = ˆL n L n. The goal is to minimize regret over all possible data sequences. 23 / 53

Prediction with expert advice N prediction strategies ( experts ) to follow. Instance x = (x 1,..., x N ): vector of experts predictions. Strategy w = (w 1,..., w N ): probability vector N k=1 w k = 1, w k 0 (learner s beliefs about experts, or probability of following a given expert). Learner s prediction: N ŷ = w(x) = w x = w k x k. k=1 Regret: competing with the best expert R n = ˆL n L n, L n = min k=1,...,n L n(k). where L n (k) is the loss of kth expert. 24 / 53

Applications of experts advice setting Combining advices from actual experts :-) Combining N learning algorithms/classifiers. Gambling (e.g., horse racing). Portfolio selection. Routing/shortest path. Spam filtering/document classification. Learning boolean formulae. Ranking/ordering.... 25 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 26 / 53

Follow the leader Follow the leader (FTL) At iteration i, choose an expert which has the smallest loss on the past data: k min = arg min k=1,...,n j=1 i 1 l j (k) = arg min L i 1(k). k=1,...,n Corresponding strategy w: w kmin = 1, w k = 0 for k k min. 27 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 0 0 0 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0 25% 0 0 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0 25% 0 50% 0 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 0.75 25% 0.25 50% 0.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 0.25 50% 0.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 0.25 50% 0% 0.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 0.75 25% 0% 1.25 50% 0% 1.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 0.75 25% 0% 0% 1.25 50% 0% 1.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 0.75 25% 0% 0% 1.25 50% 0% 100% 1.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 1.75 25% 0% 0% 1.25 50% 0% 100% 2.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 1.25 50% 0% 100% 2.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 1.25 50% 0% 100% 0% 2.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 1.75 25% 0% 0% 0% 2.25 50% 0% 100% 0% 3.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 1.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 3.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 1.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 3.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 4.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 4.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 2.25 50% 0% 100% 0% 100% 0% 4.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 5.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 5.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 2.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 5.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. 28 / 53

Failure of FTL expert 1 2 3 4 5 6 7 loss 75% 100% 100% 100% 100% 100% 100% 3.75 25% 0% 0% 0% 0% 0% 0% 3.25 50% 0% 100% 0% 100% 0% 100% 6.5 ˆL n n, L n n 2, R n n 2. Learner must hedge its bets on experts! 28 / 53

Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. 29 / 53

Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). w i+1,k = w i,ke ηl i(k) Z i, 29 / 53

Hedge [Littlestone & Warmuth, 1994; Freund & Shapire, 1997] Algorithm Each time expert k receives a loss l(k), multiply the weight w k associated with that expert by e ηl(k), where η > 0. where Z i = N k=1 w i,ke ηl i(k). Unwinding this update: w i+1,k = w i,ke ηl i(k) Z i, w i+1,k = e ηl i(k) Z i where L i (k) = j i l j(k) and Z i = N k=1 e ηl i(k). 29 / 53

Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) 30 / 53

Hedge as Bayesian updates Prior probability over N alternatives E 1,..., E N. Data likelihoods: P (D i E k ), k = 1,..., N. posterior probability w i+1,k data likelihood e ηl i(k) prior probability w i,k P (E k D i ) = P (D i E k ) P (E k ) N j=1 P (D i E j ) P (E j ) normalization Z i 30 / 53

Hedge example (η = 2) 30% 10% 20% 60% 30% 1 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 30% 10% 20% 60% 30% 0.3 0.1 0.2 0.6 0.3 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 30% 10% 20% 60% 30% 0.3 0.1 0.2 0.6 0.3 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 50% 80% 70% 30% 64% 1 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 50% 80% 70% 30% 64% 0.5 0.2 0.3 0.7 0.36 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 50% 80% 70% 30% 64% 0.5 0.2 0.3 0.7 0.36 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 50% 50% 50% 50% 50% 1 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 50% 50% 50% 50% 50% 0.5 0.5 0.5 0.5 0.5 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 50% 50% 50% 50% 50% 0.5 0.5 0.5 0.5 0.5 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 10% 10% 30% 80% 21% 1 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 10% 10% 30% 80% 21% 0.1 0.1 0.3 0.8 0.21 0.75 0.5 0.25 0 31 / 53

Hedge example (η = 2) 1 10% 10% 30% 80% 21% 0.1 0.1 0.3 0.8 0.21 0.75 0.5 0.25 0 31 / 53

Hedge analysis Regret bound For any data sequence, when η = R n 8 log N n, n log N 2 Regret bound For any data sequence, when η = 2 log N L n, R n 2L n log N + log N Both bounds are tight. 32 / 53

Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n 33 / 53

Statistical learning theory vs. online learning theory Statistical learning theory Theorem If W is finite, then for classifier w S trained by empirical risk minimization: E S P L(w S ) min L(w) w W ( ) log W = O. n Online learning theory Theorem In the expert setting ( W = N), for Hedge algorithm: 1 n R n = 1 n ˆL n 1 n L n ( ) log W = O, n Essentially the same performance without i.i.d. assumption, but at the price of averaging! 33 / 53

Prediction with expert advice: Extensions Large (or countably infinite) class of experts. Concept drift: competing with the best sequence of experts. Competing with the best small set of recurring experts. Ranking: competing with the best permutation. Partial feedback: multi-armed bandits.... 34 / 53

Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. 35 / 53

Concept drift Competing with the best sequence of m experts Idea: treat each sequence of m experts as a new expert and run Hedge on the top. w i+1,k = α 1 N i,k e ηl i(k) w + (1 α) N j=1 w i,je, ηl i(j) Mixture of standard Hedge weights and the initial distribution. Parameter α = m n (frequency of changes). Bayesian interpretation: prior and posterior over sequences of experts. Competing with the best small set of m recurring experts Mixture over all past Hedge weights. 35 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 36 / 53

Classification/regression framework Instances x: feature vectors. Outcome y: class/real output. Strategy w W: parameter vector: w = (w 1,..., w d ) R d. W can be R d or can be a regularization ball W = {w : w p B}. Prediction is linear: ŷ = w x. Loss l depends on the task we want to solve, but we assume l(y, ŷ) is convex in ŷ. 37 / 53

Examples Linear regression: y R and l(y, ŷ) = (y ŷ) 2. Logistic regression: y {0, 1} and ( l(y, ŷ) = log 1 + e yŷ). Support vector machines: y {0, 1} and l(y, ŷ) = (1 yŷ) + 38 / 53

Examples loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 hinge (SVM) square logistic 2 1 0 1 2 prediction Logistic and hinge losses plotted for y = 1. Squared error loss plotted for y = 0. 39 / 53

Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). Source: http://www-bcf.usc.edu/ larry/, http://takisword.files.wordpress.com 40 / 53

Gradient descent method Minimize a function f(w) over w R d. Gradient descent method: where η i is a step size. w i+1 = w i η i wi f(w i ). If we have a set of constraints w W, after each step we need to project back to W: w i+1 := arg min w W w i+1 w 2. Source: http://www-bcf.usc.edu/ larry/, http://takisword.files.wordpress.com 40 / 53

Online (stochastic) gradient descent Algorithm Start with any initial vector w 1 W. For i = 1, 2,...: 1 Observe input vector x i. 2 Predict with ŷ i = w i x i. 3 Outcome y i is revealed. 4 Suffer loss l i (w i ) = l(y i, ŷ i ). 5 Push the weight vector toward the negative gradient of the loss: w i+1 := w i η i wi l i (w i ). 6 If w i+1 / W, project it back to W: w i+1 := arg min w W w i+1 w 2. 41 / 53

Online vs. standard gradient descent The function we want to minimize: n f(w) = l j (w). j=1 Standard = batch GD w i+1 : = w i η i wi f(w i ) = w i η i wi l j (w i ) O(n) per iteration, need to see all data. j Online GD w i+1 := w i η i wi l i (w i ). O(1) per iteration, need to see a single data point. 42 / 53

Online (stochastic) gradient descent W w 1 43 / 53

Online (stochastic) gradient descent η w1 l 1 (w 1 ) W w 1 43 / 53

Online (stochastic) gradient descent w 2 η w1 l 1 (w 1 ) W w 1 43 / 53

Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 1 43 / 53

Online (stochastic) gradient descent η w2 l 2 (w 2 ) w 2 W w 3 w 1 43 / 53

Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53

Online (stochastic) gradient descent η w3 l 3 (w 3 ) w 2 W w 3 w 1 43 / 53

Online (stochastic) gradient descent projection w 4 w 2 W w 3 w 1 43 / 53

Calculating the gradient The gradient wi l i (w i ) can be obtained by applying a chain rule: l i (w) = l(y i, w x i ) w k w k l(y, ŷ) (w x i ) = ŷ=w ŷ x i w k l(y, ŷ) = ŷ=w x ik. ŷ x i If we denote l l(y,ŷ) i (w) := ŷ=w, then we can write x i ŷ wi l i (w i ) = l i(w i )x i 44 / 53

Update rules for specific losses Linear regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) Update: w i+1 = w i + 2η i (y i ŷ i )x i. 45 / 53

Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i 45 / 53

Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 45 / 53

Update rules for specific losses Linear regression: Update: Logistic regression: l(y, ŷ) = (y ŷ) 2 l (w) = 2(y ŷ) w i+1 = w i + 2η i (y i ŷ i )x i. l(y, ŷ) = log ( 1 + e yŷ) l (w) = y 1 + e yŷ Update: y i x i w i+1 = w i + η i 1 + e y. iŷ i Support vector machines: l(y, ŷ) = (1 yŷ) + l (w) = Update: w i+1 = w i + η i 1[yŷ 1]y i x i { 0 if yŷ > 1 y if yŷ 1 perceptron! 45 / 53

Projection w i := arg min w W w i w 2. 46 / 53

Projection w i := arg min w W w i w 2. When W = R d no projection step. 46 / 53

Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. 46 / 53

Projection w i := arg min w W w i w 2. When W = R d no projection step. When W = {w : w B} is an L 2 -ball, projection corresponds to renormalization of the weight vector: if w i > B = w i := Bw i w i. Equivalent to L 2 regularization. When W = {w : d k=1 w k B} is an L 1 -ball, projection corresponds to an additive shift of absolute values and clipping smaller weights to 0. Equivalent to L 1 regularization, results in sparse solutions. 46 / 53

L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 47 / 53

L 1 vs. L 2 projection w i := arg min w W w i w 2. w i w i w 2 w 2 W w 1 w W 1 projected w i projected w i 47 / 53

Convergence of online gradient descent Theorem Let 0 W. Assume w l i (w) L and let W = max w W w. Then, with w 1 = 0, η i = 1 W i L regret is bounded by: the R n W L 2n, so that the per-iteration regret: 1 2 n R n W L n. 48 / 53

Exponentiated Gradient [Kivinen & Warmuth 1996] Gradient descent w i+1 := w i η i wi l i (w i ). Exponentiated descent w i+1 := 1 Z i w i e η i wi l i (w i ). Direct extension of Hedge for classification/regression framework. Requires positive weights, but can be applied in a general setting using doubling trick. Works much better than online gradient descent when: d is very large (many features) only a small number of features is relevant. Works very well when the best model is sparse, but does not keep sparse solution itself! 49 / 53

Convergence of exponentiated gradient Theorem Let W be a positive orthant of L 1 -ball with radius W, and assume w l i (w) L. Then, with w 1 = ( 1 d,..., d) 1, η i = 1 W 8 log d i L the regret is bounded by: n log d R n W L, 2 so that the per-iteration regret: 1 log d n R n W L 2n. 50 / 53

Extensions Concept drift: competing with drifting parameter vectors. Partial feedback: contextual multi-armed bandit problems. Improvements for some (strongly convex, exp-concave) loss functions. Infinite-dimensional feature spaces via kernel trick. Learning matrix parameters (matrix norm regularization, positive definiteness, permutation matrices).... 51 / 53

Outline 1 Example: Online (Stochastic) Gradient Descent 2 Statistical learning theory 3 Online learning 4 Algorithms for prediction with experts advice 5 Algorithms for classification and regression. 6 Conclusions 52 / 53

Conclusions A theoretical framework for learning without i.i.d. assumption. Performance bounds often simpler to prove than in the stochastic setting. Easy to generalize to changing environments (concept drift), more general actions (reinforcement learning), partial information (bandits), etc. Results in online algorithms directly applicable to large-scale learning problems. Most of currently used offline learning algorithms employ online learning as an optimization routine. 53 / 53