Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18
Life-long learning x 1 x 2... x 1000... x 1000000............ y 1 = 0 - - y 1000 = 1... y 1000000 = 0... Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 2 / 18
his is how children learn, too x 1 x 2... x 1000... x 1000000............ y 1 = 0 - - y 1000 = 1... y 1000000 = 0... Unlike standard supervised learning: n examples arrive sequentially, cannot even store them all most examples unlabeled no iid assumption, p(x, y) can change over time Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 3 / 18
New paradigm: online semi-supervised learning Main contribution: merging 1 online learning: learn non-iid sequentially, but fully labeled 2 semi-supervised learning: learn from labeled and unlabeled data, but in batch mode 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 4 / 18
Review: batch manifold regularization A form of graph-based semi-supervised learning [Belkin et al. JMLR06]: Graph on x 1... x n Edge weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk: J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 d1 c(f(x), y) convex loss function, e.g., the hinge loss. Solution f = arg min f J(f). Generalizes graph mincut and label propagation. d4 d2 d3 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 5 / 18
From batch to online batch risk = average instantaneous risks J(f) = 1 t=1 J t(f) Batch risk J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 Instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges between x t and all previous examples) i=1 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 6 / 18
Online convex programming Instead of minimizing convex J(f), reduce convex J t (f) at each step t. J t (f) f t+1 = f t η t f Remarkable no regret guarantee against adversary: Accuracy can be arbitrarily bad if adversary flips target often If so, no batch learner in hindsight can do well either ft regret 1 J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup 1 t=1 J t(f t ) J(f ) 0. If no adversary (iid), the average classifier f = 1/ t=1 f t is good: J( f) J(f ). Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 7 / 18
Kernelized algorithm Init: t = 1, f 1 = 0 Repeat t 1 f t ( ) = α (t) i K(x i, ) i=1 1 receive x t, predict f t (x t ) = t 1 2 occasionally receive y t 3 update f t to f t+1 by i=1 α(t) i K(x i, x t ) α (t+1) i = (1 η t λ 1 )α (t) i 2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t (x t ))w it η t l δ(y t)c (f(x t ), y t ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 8 / 18
Sparse approximation he algorithm is impractical space O( ): stores all previous examples time O( 2 ): each new example compared to all previous ones wo ways to speed up: buffering, or random projection tree Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 9 / 18
Sparse approximation 1: buffering Keep a size τ buffer approximate representers: f t = t 1 i=t τ α(t) i K(x i, ) approximate instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 t 2 f 2 K + λ 2 τ dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2 w it i=t τ Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 10 / 18
Sparse approximation 1: buffer update At each step, start with the current τ representers: f t = t 1 i=t τ Gradient descent on τ + 1 terms: f = α (t) i K(x i, ) + 0K(x t, ) t i=t τ Reduce to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2 α (t+1) K(x i, ) by Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 11 / 18
Sparse approximation 2: random projection tree [Dasgupta and Freund, SOC08] Discretize data manifold by online clustering. When a cluster accumulates enough examples, split along random hyperplane. Extends k-d tree. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 12 / 18
Sparse approximation 2: random projection tree We use the clusters N (µ i, Σ i ) as representers: f t = s i=1 β (t) i K(µ i, ) Cluster graph edge weight between a cluster µ i and example x t is [ w µi t = E x N (µi,σ i ) exp ( x x t 2 )] 2σ 2 = (2π) d 2 Σi 1 2 Σ0 1 1 2 Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t. µ i + x t Σ 1 0 x t µ Σ µ ) ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 13 / 18
Experiment: runtime Buffering and RPtree scales linearly, enabling life-long learning. Spirals MNIS 0 vs. 1 ime (seconds) 500 400 300 200 100 500 400 300 200 100 Batch MR Online MR Online MR (buffer) Online RPtree 0 0 2000 4000 6000 0 0 5000 10000 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 14 / 18
Experiment: risk Online MR risk J air ( ) 1 t=1 J t(f t ) approaches batch risk J(f ) as increases. Risk 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 J(f*) Batch MR J air () Online MR J air () Online MR (buffer) J air () Online RPtree 0.7 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 15 / 18
Experiment: generalization error of f if iid A variation of buffering as good as batch MR (preferentially keep labeled examples, but not their labels, in buffer). Generalization error rate 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Online RPtree (PPK) Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 0 0 2000 4000 6000 8000 10000 (a) Spirals (b) Face Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree 0.05 0.05 0 0 2000 4000 6000 8000 10000 0 0 2000 4000 6000 8000 10000 (c) MNIS 0 vs. 1 (d) MNIS 1 vs. 2 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 16 / 18
Experiment: adversarial concept drift Slowly rotating spirals, both p(x) and p(y x) changing. Batch f vs. online MR buffering f est set drawn from the current p(x, y) at time. 0.7 Batch MR Online MR (buffer) 0.6 Generalization error rate 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 17 / 18
Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 18 / 18