Online Semi-Supervised Learning



Similar documents
Semi-Supervised Support Vector Machines and Application to Spam Filtering

Simple and efficient online algorithms for real world applications

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Follow the Perturbed Leader

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Graph-based Semi-supervised Learning

Introduction to Online Learning Theory

Supervised Learning (Big Data Analytics)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Tree based ensemble models regularization by convex optimization

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

DUOL: A Double Updating Approach for Online Learning

Online Algorithms: Learning & Optimization with No Regret.

Interactive Machine Learning. Maria-Florina Balcan

Network Intrusion Detection using Semi Supervised Support Vector Machine

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Adaptive Online Gradient Descent

Online Convex Optimization

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classification algorithm in Data mining: An Overview

The Artificial Prediction Market

Multi-Class Active Learning for Image Classification

Lecture 2: The SVM classifier

Large Scale Learning

Model Combination. 24 Novembre 2009

Basics of Statistical Machine Learning

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Predict Influencers in the Social Network

HT2015: SC4 Statistical Data Mining and Machine Learning

Unsupervised Data Mining (Clustering)

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Social Media Mining. Data Mining Essentials

Linear Threshold Units

Big Data Analytics CSCI 4030

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Making Sense of the Mayhem: Machine Learning and March Madness

Decompose Error Rate into components, some of which can be measured on unlabeled data

An Introduction to Machine Learning

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

List of Publications by Claudio Gentile

Multi-Class and Structured Classification

Robust personalizable spam filtering via local and global discrimination modeling

CSC 411: Lecture 07: Multiclass Classification

A semi-supervised Spam mail detector

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Local features and matching. Image classification & object localization

Machine Learning and Pattern Recognition Logistic Regression

Early defect identification of semiconductor processes using machine learning

Statistical Machine Learning

How To Classify Data Stream Mining

Active Learning SVM for Blogs recommendation

Introduction to Markov Chain Monte Carlo

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Lecture 6: Logistic Regression

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Consistent Binary Classification with Generalized Performance Metrics

Large Linear Classification When Data Cannot Fit In Memory

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

A Simple Introduction to Support Vector Machines

Hyperspectral images retrieval with Support Vector Machines (SVM)

L3: Statistical Modeling with Hadoop

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Stochastic Inventory Control

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Spark and the Big Data Library

Data, Measurements, Features

Cluster Analysis: Advanced Concepts

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Large-Scale Similarity and Distance Metric Learning

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Model Trees for Classification of Hybrid Data Types

: Introduction to Machine Learning Dr. Rita Osadchy

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Robust Personalizable Spam Filtering Via Local and Global Discrimination Modeling

OUTLIER ANALYSIS. Data Mining 1

The Advantages and Disadvantages of Online Linear Optimization

Distributed Machine Learning and Big Data

Transcription:

Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18

Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 - - y 1000 = 1... y 1000000 = 0... Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 2 / 18

his is how children learn, too x 1 x 2... x 1000... x 1000000............ y 1 = 0 - - y 1000 = 1... y 1000000 = 0... Unlike standard supervised learning: n examples arrive sequentially, cannot even store them all most examples unlabeled no iid assumption, p(x, y) can change over time Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 3 / 18

New paradigm: online semi-supervised learning Main contribution: merging 1 online learning: learn non-iid sequentially, but fully labeled 2 semi-supervised learning: learn from labeled and unlabeled data, but in batch mode 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 4 / 18

Review: batch manifold regularization A form of graph-based semi-supervised learning [Belkin et al. JMLR06]: Graph on x 1... x n Edge weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk: J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 d1 c(f(x), y) convex loss function, e.g., the hinge loss. Solution f = arg min f J(f). Generalizes graph mincut and label propagation. d4 d2 d3 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 5 / 18

From batch to online batch risk = average instantaneous risks J(f) = 1 t=1 J t(f) Batch risk J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 Instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges between x t and all previous examples) i=1 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 6 / 18

Online convex programming Instead of minimizing convex J(f), reduce convex J t (f) at each step t. J t (f) f t+1 = f t η t f Remarkable no regret guarantee against adversary: Accuracy can be arbitrarily bad if adversary flips target often If so, no batch learner in hindsight can do well either ft regret 1 J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup 1 t=1 J t(f t ) J(f ) 0. If no adversary (iid), the average classifier f = 1/ t=1 f t is good: J( f) J(f ). Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 7 / 18

Kernelized algorithm Init: t = 1, f 1 = 0 Repeat t 1 f t ( ) = α (t) i K(x i, ) i=1 1 receive x t, predict f t (x t ) = t 1 2 occasionally receive y t 3 update f t to f t+1 by i=1 α(t) i K(x i, x t ) α (t+1) i = (1 η t λ 1 )α (t) i 2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t (x t ))w it η t l δ(y t)c (f(x t ), y t ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 8 / 18

Sparse approximation he algorithm is impractical space O( ): stores all previous examples time O( 2 ): each new example compared to all previous ones wo ways to speed up: buffering, or random projection tree Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 9 / 18

Sparse approximation 1: buffering Keep a size τ buffer approximate representers: f t = t 1 i=t τ α(t) i K(x i, ) approximate instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 t 2 f 2 K + λ 2 τ dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2 w it i=t τ Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 10 / 18

Sparse approximation 1: buffer update At each step, start with the current τ representers: f t = t 1 i=t τ Gradient descent on τ + 1 terms: f = α (t) i K(x i, ) + 0K(x t, ) t i=t τ Reduce to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2 α (t+1) K(x i, ) by Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 11 / 18

Sparse approximation 2: random projection tree [Dasgupta and Freund, SOC08] Discretize data manifold by online clustering. When a cluster accumulates enough examples, split along random hyperplane. Extends k-d tree. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 12 / 18

Sparse approximation 2: random projection tree We use the clusters N (µ i, Σ i ) as representers: f t = s i=1 β (t) i K(µ i, ) Cluster graph edge weight between a cluster µ i and example x t is [ w µi t = E x N (µi,σ i ) exp ( x x t 2 )] 2σ 2 = (2π) d 2 Σi 1 2 Σ0 1 1 2 Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t. µ i + x t Σ 1 0 x t µ Σ µ ) ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 13 / 18

Experiment: runtime Buffering and RPtree scales linearly, enabling life-long learning. Spirals MNIS 0 vs. 1 ime (seconds) 500 400 300 200 100 500 400 300 200 100 Batch MR Online MR Online MR (buffer) Online RPtree 0 0 2000 4000 6000 0 0 5000 10000 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 14 / 18

Experiment: risk Online MR risk J air ( ) 1 t=1 J t(f t ) approaches batch risk J(f ) as increases. Risk 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 J(f*) Batch MR J air () Online MR J air () Online MR (buffer) J air () Online RPtree 0.7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 15 / 18

Experiment: generalization error of f if iid A variation of buffering as good as batch MR (preferentially keep labeled examples, but not their labels, in buffer). Generalization error rate 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Online RPtree (PPK) Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 0 0 2000 4000 6000 8000 10000 (a) Spirals (b) Face Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 2000 4000 6000 8000 10000 0 0 2000 4000 6000 8000 10000 (c) MNIS 0 vs. 1 (d) MNIS 1 vs. 2 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 16 / 18

Experiment: adversarial concept drift Slowly rotating spirals, both p(x) and p(y x) changing. Batch f vs. online MR buffering f est set drawn from the current p(x, y) at time. 0.7 Batch MR Online MR (buffer) 0.6 Generalization error rate 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 17 / 18

Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 18 / 18