Simple and efficient online algorithms for real world applications



Similar documents
DUOL: A Double Updating Approach for Online Learning

Introduction to Online Learning Theory

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

List of Publications by Claudio Gentile

Direct Loss Minimization for Structured Prediction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Online Semi-Supervised Learning

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

From N to N+1: Multiclass Transfer Incremental Learning

Linear Threshold Units

Online Feature Selection for Mining Big Data

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining. Nonlinear Classification

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Supervised Learning (Big Data Analytics)

Adaptive Online Gradient Descent

The Artificial Prediction Market

Week 1: Introduction to Online Learning

Large-Scale Similarity and Distance Metric Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Steven C.H. Hoi. Methods and Applications. School of Computer Engineering Nanyang Technological University Singapore 4 May, 2013

Interactive Machine Learning. Maria-Florina Balcan

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning in Spam Filtering

Big Data Analytics CSCI 4030

Distributed Machine Learning and Big Data

Online Passive-Aggressive Algorithms on a Budget

Statistical Machine Learning

Online Algorithms: Learning & Optimization with No Regret.

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction to Support Vector Machines. Colin Campbell, Bristol University

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

CSCI567 Machine Learning (Fall 2014)

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

An Introduction to Machine Learning

Machine Learning and Pattern Recognition Logistic Regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Programming Exercise 3: Multi-class Classification and Neural Networks

The p-norm generalization of the LMS algorithm for adaptive filtering

Chapter 4: Artificial Neural Networks

Logistic Regression for Spam Filtering

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Data Mining - Evaluation of Classifiers

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Follow the Perturbed Leader

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

ADVANCED MACHINE LEARNING. Introduction

STA 4273H: Statistical Machine Learning

Big learning: challenges and opportunities

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Lecture 3: Linear methods for classification

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Steven C.H. Hoi School of Information Systems Singapore Management University

Linear Classification. Volker Tresp Summer 2015

Lecture 2: The SVM classifier

Making Sense of the Mayhem: Machine Learning and March Madness

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Predict Influencers in the Social Network

MapReduce/Bigtable for Distributed Optimization

Machine Learning Logistic Regression

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Big Data Analytics. Lucas Rego Drumond

Analecta Vol. 8, No. 2 ISSN

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Differential privacy in health care analytics and medical research An interactive tutorial

Scalable Developments for Big Data Analytics in Remote Sensing

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Federated Optimization: Distributed Optimization Beyond the Datacenter

Question 2 Naïve Bayes (16 points)

Chapter 6. The stacking ensemble approach

Lecture 6: Logistic Regression

Machine learning challenges for big data

HT2015: SC4 Statistical Data Mining and Machine Learning

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

CSC 411: Lecture 07: Multiclass Classification

Beyond stochastic gradient descent for large-scale machine learning

Support Vector Machine (SVM)

CSE 473: Artificial Intelligence Autumn 2010

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Machine Learning Final Project Spam Filtering

Large-Scale Machine Learning with Stochastic Gradient Descent

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

1 Maximum likelihood estimation

Bilinear Prediction Using Low-Rank Models

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Regret Minimization for Reserve Prices in Second-Price Auctions

Lecture 9: Introduction to Pattern Analysis

Online Convex Optimization

Transcription:

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDoc in Machine Learning

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDoc in Machine Learning I like theoretical motivated algorithms

Something about me PhD in Robotics at LIRA-Lab, University of Genova PostDoc in Machine Learning I like theoretical motivated algorithms But they must work well too! ;-)

Outline 1 2 OM-2 Projectron BBQ

Outline 1 2 OM-2 Projectron BBQ

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier Learning with limited feedback: selecting publicity banners

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier Learning with limited feedback: selecting publicity banners Active learning: asking for specific samples to label

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier Learning with limited feedback: selecting publicity banners Active learning: asking for specific samples to label Interactive learning: the agent and the human are learning at the same time

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier Learning with limited feedback: selecting publicity banners Active learning: asking for specific samples to label Interactive learning: the agent and the human are learning at the same time These problems cannot be solved with standard batch learning!

What do they have in common? Spam filtering: minimize the number of wrongly classified mails, while training Shifting distributions: the Learner must track the best classifier Learning with limited feedback: selecting publicity banners Active learning: asking for specific samples to label Interactive learning: the agent and the human are learning at the same time These problems cannot be solved with standard batch learning! But they can in the online learning framework!

Learning in an artificial agent We have the Learner and the Teacher. The Learner observes examples in a sequence of rounds, and constructs the classification function incrementally.

Learning in an artificial agent Given an input, the Learner predicts its label.

Learning in an artificial agent Then the Teacher reveals the true label.

Learning in an artificial agent The Learner compares its prediction with the true label and update its knowledge. The aim of the learner is to minimize the number of mistakes.

Machine learning point of view on online learning The Teacher is a black box It can chose the examples arbitrarily - in the worst case the choice can be adversarial! There is no IID assumption on the data! Useful to model interaction between the user and the algorithm or non-stationary data Performance is measured while training : no separate testing set Update after each sample: efficiency of the update is important See [Cesa-Bianchi & Lugosi, 2006] for a full introduction to the theory of online learning. university-logo

- Formal setting Learning goes on in a sequence of T rounds Instances: x t X Images, sounds, etc. Labels: y t Y Labels, numbers, structured output, etc. Prediction rule, f t (x) = ŷ for binary classification ŷ = sign( w, x ) Loss, l(ŷ, y) R +

Regret bounds The learner must minimize its loss on the T observed samples T l(ŷ t, y t ) t=1

Regret bounds The learner must minimize its loss on the T observed samples, compared to the loss of the best fixed predictor T l(ŷ t, y t ) min f t=1 T l (f (x t ), y t ) + R T t=1

Regret bounds The learner must minimize its loss on the T observed samples, compared to the loss of the best fixed predictor T l(ŷ t, y t ) min f t=1 T l (f (x t ), y t ) + R T t=1 We want the regret, R T, to be small: a small regret indicates that the performance of the learner is not too far from the one of the best fixed classifier

A loss for everyone Mistake, l 01 (w, x, y) := 1(y sign( w, x )) Hinge, l hinge (w, x, y) := max (0, 1 y w, x ) Logistic regression, l logreg (w, x, y) := log (1 + exp ( y w, x )) exponential loss, etc.

Let s start from the Perceptron for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = sign( w, x t ) Receive label y t if y t ŷ t then w = w + y t x t end if end for

The mistake bound of the Perceptron Let (x 1, y 1 ),, (x T, y T ) be any sequence of instance-label pairs, y t { 1, +1}, and x t R. The number of mistakes of the Perceptron is bounded by where L = T i l hinge (u, x i, y i ) min L + u 2 R 2 + u R L u } {{ } R T

The mistake bound of the Perceptron Let (x 1, y 1 ),, (x T, y T ) be any sequence of instance-label pairs, y t { 1, +1}, and x t R. The number of mistakes of the Perceptron is bounded by where L = T i l hinge (u, x i, y i ) min L + u 2 R 2 + u R L u } {{ } R T If the problem is linearly separable the maximum number of mistakes is R 2 u 2, regardless of the ordering of the samples!

The mistake bound of the Perceptron Let (x 1, y 1 ),, (x T, y T ) be any sequence of instance-label pairs, y t { 1, +1}, and x t R. The number of mistakes of the Perceptron is bounded by where L = T i l hinge (u, x i, y i ) min L + u 2 R 2 + u R L u } {{ } R T If the problem is linearly separable the maximum number of mistakes is R 2 u 2, regardless of the ordering of the samples! Video university-logo

Pros and Cons of the Perceptron Pros :-) Very efficient! It can be easily trained with huge datasets Theoretical guarantee on the maximum number of mistakes

Pros and Cons of the Perceptron Pros :-) Cons :-( Very efficient! It can be easily trained with huge datasets Theoretical guarantee on the maximum number of mistakes Linear hyperplane only Binary classification only

Pros and Cons of the Perceptron Pros :-) Cons :-( Very efficient! It can be easily trained with huge datasets Theoretical guarantee on the maximum number of mistakes Linear hyperplane only Binary classification only Let s generalize the Perceptron!

Non-linear classifiers using Kernels! Suppose to transform the inputs through a non-linear transform φ(x), to the feature space A linear classifier in the feature space will result in a non-linear classifier in the input space We can do even better: the algorithms just need to access to φ(x 1 ), φ(x 2 ), so if we have such a function, K (x 1, x 2 ), we do not need φ()!

Kernel Perceptron for t = 1, 2,..., T do Receive new instance ( x t ) Predict ŷ t = sign x i S y ik (x i, x t ) Receive label y t if y t ŷ t then Add x t to the support set S end if end for

Follow The Regularized Leader algorithm is just a particular case of a more general algorithm: the follow the regularized leader (FTRL) algorithm In FTRL, at each time step we predict with the approximate batch solution using a linearization of the loss, instead of the true loss functions Regret bounds will come almost for free! See [Kakade et al., 2009] for FTRL and [Orabona&Crammer, 2010] for an even more general algorithm

The general algorithm for t = 1, 2,..., T do Receive new instance x t Predict ŷ t Receive label y t θ = θ η t l(w, x t, y t ) w = g t (θ) end for F. Orabona, and K. Crammer. New Adaptive for Online Classification. Accepted in Neural Information Processing Systems (NIPS) 2010

The general algorithm for t = 1, 2,..., T do Receive new instance x t Predict ŷ t Receive label y t θ = θ η t l(w, x t, y t ) w = g t (θ) end for Suppose g t (θ) = 1 2 θ 2 g t (θ) = θ Let s use the hinge loss l(w, x t, y t ) = y t x t η t = 1 on mistakes, 0 otherwise We recovered the Perceptron algorithm! F. Orabona, and K. Crammer. New Adaptive for Online Classification. Accepted in Neural Information Processing Systems (NIPS) 2010

The Recipe Create the online algorithm that best suits your needs! Define a task this will define a (convex) loss function. Define which charateristic you would like your solution to have this will define a (strongly convex) regularizer. Compute the gradient of the loss. Compute the gradient of the fenchel dual of the regularizer. Just code it!

Some examples Multiclass-multilabel classification, M classes, M different classifiers w i. l( w, x, Y) = max(1 + max y / Y w y, x min y Y w y, x, 0) Do you prefer sparse classifiers? Use the regularizer w 2 p with 1 < p 2. K different kernels? Use the regularizer [ w 1 2,..., w K 2 ] 2 with 1 < p 2. p

Batch Solutions with Online What most of the persons do when they want a batch solution: they stop the online algorithm and use the last solution found. This is wrong: the last solution can be arbitrarly bad! If the data are IID you can simply use the averaged solution [Cesa-Bianchi et al., 2004]. Alternative way: several epochs of training, until convergence.

Batch Solutions with Online What most of the persons do when they want a batch solution: they stop the online algorithm and use the last solution found. This is wrong: the last solution can be arbitrarly bad! If the data are IID you can simply use the averaged solution [Cesa-Bianchi et al., 2004]. Alternative way: several epochs of training, until convergence. Video

Outline OM-2 Projectron BBQ 1 2 OM-2 Projectron BBQ

Multi Kernel Learning online? OM-2 Projectron BBQ Cue 1 w 1 Cue 2 w 2... Σ Prediction Cue F w F We have F different features, e.g. color, shape, etc.

Multi Kernel Learning online? OM-2 Projectron BBQ Cue 1 w 1 Cue 2 w 2... Σ Prediction Cue F w F We have F different features, e.g. color, shape, etc. Solution: Online Multi-Class Multi-Kernel learning algorithm L. Jie, F. Orabona, M. Fornoni, B. Caputo, and N. Cesa-Bianchi. OM-2: An Online Multi-class Multi-kernel Learning Algorithm. In Proc. of the 4th IEEE for Computer Vision Workshop (in CVPR10) university-logo

OM-2: Pseudocode OM-2 Projectron BBQ Input: q Initialize: θ 1 = 0, w 1 = 0 for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = argmax y=1,...,m Receive label y t z t = φ(x t, y t ) φ(x t, ŷ t ) if l( w t, x t, y t ) > 0 then η t = min { 2 w t zt 1, 1 z t 2 2,q else η t = 0 θ t+1 = θ t + η t z ( t w j t+1 = 1 q end for w t φ(x t, y) } ) q 2 θ j t+1 2 θ j θ t+1 2,q t+1, j = 1,, F Code available! university-logo

Experiments OM-2 Projectron BBQ We compared OM-2 to OMCL [Jie et al. ACCV09], and to PA-I [Crammer et al. JMLR06] using the best feature and the sum of the kernels. We also used SILP [Sonnenburg et al. JMLR06], a state-of-the-art MKL batch solver. We used the Caltech-101 with 39 different kernels, as in [Gehler and Nowozin ICCV09].

OM-2 Projectron BBQ Caltech-101: online performance 100 Caltech 101 (5 splits, 30 training examples per category) Online average training error rate 95 90 85 80 75 70 OM 2, p=1.01 OMCL PA I (average kernel) PA I (best kernel) 500 1000 1500 2000 2500 3000 3500 number of training samples Best results with p = 1.01. OM-2 achieves the best performance among the online algorithms.

OM-2 Projectron BBQ Caltech-101: batch performance classification rate on test data 80 75 70 65 60 55 50 Caltech 101 (5 splits, 30 training examples per category) OM 2, p=1.01 OMCL PA I (average kernel) PA I (best kernel) SILP 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 number of epochs Matlab implementation of OM-2 takes 45 mins, SILP more than 2 hours. The performance advantage of OM-2 over SILP is due the fact that OM-2 is based on a native multiclass formulation. university-logo

OM-2 Projectron BBQ Learning with bounded complexity Perceptron-like algorithms will never stop updating the solution if the problem is not linearly separable.

OM-2 Projectron BBQ Learning with bounded complexity Perceptron-like algorithms will never stop updating the solution if the problem is not linearly separable. On the other hand, if we use kernels, sooner or later the memory of the computer will finish...

OM-2 Projectron BBQ Learning with bounded complexity Perceptron-like algorithms will never stop updating the solution if the problem is not linearly separable. On the other hand, if we use kernels, sooner or later the memory of the computer will finish... Is it possible to bound the complexity of the learner?

OM-2 Projectron BBQ Learning with bounded complexity Perceptron-like algorithms will never stop updating the solution if the problem is not linearly separable. On the other hand, if we use kernels, sooner or later the memory of the computer will finish... Is it possible to bound the complexity of the learner? Yes! Just update with a noisy version of the gradient If the new vector can be well approximated with the old ones, just update the coefficients of the old ones. F. Orabona, J. Keshet, and B. Caputo. Bounded Kernel-Based. Journal of Machine Learning Research, 2009 university-logo

The Projectron Algorithm OM-2 Projectron BBQ for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = sign( w, x t ) Receive label y t if y t ŷ t then w = w + y t x t w = w + y t P(x t ) if δ t = w w η then w = w else w = w end if end if end for It is possible to calculate the projection even using Kernels. The algorithm has a mistake bound and a bounded memory growth. Code available! university-logo

OM-2 Projectron BBQ Average Online Error vs Budget Size Adult9, 32561 samples, 123 features, Gaussian Kernel Number of Mistakes (%) 0.4 0.35 0.3 0.25 A9A Gaussian Kernel RBP Forgetron Plain Perceptron PA I Projectron++ 0.2 Size of the support set = 12510 (36.9) 1000 2000 3000 4000 5000 6000 7000 Size of the Support Set university-logo

OM-2 Projectron BBQ Robot Navigation & Phoneme Recognition Place Recognition IDOL2 database 5 rooms CRFH features Phoneme recognition Subset of the TIMIT corpus 55 phonemes MFCC + + features

Place recognition OM-2 Projectron BBQ 1 IDOL2 exp χ 2 kernel 0.9 Accuracy on Test Set (%) 0.8 0.7 0.6 0.5 0.4 Perceptron Projectron++ η 1 Projectron++ η 2 Projectron++ η 3 0 5 10 15 Training step

Place recognition OM-2 Projectron BBQ Size of the Support Set 300 250 200 150 100 Perceptron Projectron++ η 1 Projectron++ η 2 Projectron++ η 3 IDOL2 exp χ 2 kernel 50 0 0 2 4 6 8 10 12 14 Training step

Phoneme recognition OM-2 Projectron BBQ 0.33 0.32 Phoneme recognition Gaussian kernel Projectron++ PA I 0.31 Number of Mistakes (%) 0.3 0.29 0.28 0.27 0.26 0.25 0.24 1 1.5 2 2.5 3 3.5 4 Size of the Support Set x 10 4

Phoneme recognition OM-2 Projectron BBQ 0.45 0.445 Phoneme recognition Gaussian kernel Projectron++ PA I Error on Test Set (%) 0.44 0.435 0.43 0.425 0.42 1 1.5 2 2.5 3 3.5 4 Size of the Support Set x 10 4

OM-2 Projectron BBQ Semi-supervised online learning We are given N training samples and a regression problem, {x i, y i } N i=1, y i [ 1, 1] Obtaining the output for a given sample can be expensive, so we want to ask for as few as possible labels. We assume the outputs y t are realizations of random variables Y t such that E Y t = u x t for all t, where u R d is a fixed and unknown vector such that u = 1. The order of the data is still adversarial, but the outputs are generated by a stochastic source

OM-2 Projectron BBQ Semi-supervised online learning We are given N training samples and a regression problem, {x i, y i } N i=1, y i [ 1, 1] Obtaining the output for a given sample can be expensive, so we want to ask for as few as possible labels. We assume the outputs y t are realizations of random variables Y t such that E Y t = u x t for all t, where u R d is a fixed and unknown vector such that u = 1. The order of the data is still adversarial, but the outputs are generated by a stochastic source Solution: Bound on Bias Query algorithm (BBQ) N. Cesa-Bianchi, C. Gentile and F. Orabona. Robust Bounds for Classification via Selective Sampling. In Proc. of the International Conference on Machine Learning (ICML), 2009 university-logo

OM-2 Projectron BBQ The Parametric BBQ Algorithm Parameters: 0 < ε, δ < 1 for t = 1, 2,..., T do Receive new instance x t Predict ŷ t = w x ( t ) r = x t I + S t 1 St 1 + x t x 1 t xt ( ) q = St 1 I + S t 1 St 1 + x t x 1 t xt ( ) s = I + S t 1 St 1 + x t x 1 t xt if [ ε r s ] + < q t(t + 1) 2 ln then 2δ Query label y t Update w with a Regularized Least Square S t = [S t, x t ] end if end for The algorithm follows the general framework Every time a query is not issued the predicted output is far from the correct one at most by ε

OM-2 Projectron BBQ Regret Bound of Parametric BBQ Theorem If Parametric BBQ is run with input ε, δ (0, 1) then: with probability at least 1 δ, t t ε holds on all time steps t when no query is issued; the number N T of queries issued after any number T of steps is bounded as ( ( d N T = O ε 2 ln T ) ) ln(t /δ) ln. δ ε

Synthetic Experiment OM-2 Projectron BBQ Maximum error 1 0.8 0.6 0.4 0.2 0 Theoretical maximum error Measured maximum error Fraction of queried labels 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 ε 0.5 0.4 0.3 0.2 0.1 Fraction of queried labels We tested Parametric BBQ. 10,000 random examples on the unit circle in R 2. The labels were generated according to our noise model using a randomly selected hyperplane u with unit norm. university-logo

Real World Experiments OM-2 Projectron BBQ 0.75 0.62 F measure 0.7 0.65 0.6 0.55 0.5 SOLE Random Parametric BBQ 0.45 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Fraction of queried labels F measure 0.6 0.58 0.56 0.54 0.52 0.5 SOLE 0.48 Random Parametric BBQ 0.46 0 0.01 0.02 0.03 0.04 0.05 0.06 Fraction of queried labels F-measure and fraction of queried labels for different algorithms on Adult9 dataset (left)(gaussian Kernel) and RCV1 (right)(linear kernel). university-logo

Summary OM-2 Projectron BBQ Many real world problem cannot be solved using the standard batch framework The online learning framework offers a useful tool in these cases A general algorithm that covers many of the previous known online learning algorithms has been presented The framework allows to easily design algorithms for specific problems

OM-2 Projectron BBQ Thanks for your attention Code: http://dogma.sourceforge.net My website: http://francesco.orabona.com