Online Semi-Supervised Learning

Size: px

Start display at page:

Download "Online Semi-Supervised Learning"

Derick Gregory
8 years ago
Views:

1 Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 1 / 18

edu Computer Sciences University of Wisconsin Madison

2 Life-long learning x 1 x 2... x x y 1 = y 1000 = 1... y = 0... Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 2 / 18

3 his is how children learn, too x 1 x 2... x x y 1 = y 1000 = 1... y = 0... Unlike standard supervised learning: n examples arrive sequentially, cannot even store them all most examples unlabeled no iid assumption, p(x, y) can change over time Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 3 / 18

4 New paradigm: online semi-supervised learning Main contribution: merging 1 online learning: learn non-iid sequentially, but fully labeled 2 semi-supervised learning: learn from labeled and unlabeled data, but in batch mode 1 At time t, adversary picks x t X, y t Y not necessarily iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains (unlabeled) 4 Learner updates to f t+1 based on x t and y t (if given). Repeat. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 4 / 18

iid, shows x t 2 Learner has classifier f t : X R, predicts f t (x t ) 3 With small probability, adversary reveals y t ; otherwise it abstains

5 Review: batch manifold regularization A form of graph-based semi-supervised learning [Belkin et al. JMLR06]: Graph on x 1... x n Edge weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk: J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 d1 c(f(x), y) convex loss function, e.g., the hinge loss. Solution f = arg min f J(f). Generalizes graph mincut and label propagation. d4 d2 d3 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 5 / 18

weights w st encode similarity between x s, x t, e.g., knn Assumption: similar examples have similar labels Manifold regularization minimizes risk:

6 From batch to online batch risk = average instantaneous risks J(f) = 1 t=1 J t(f) Batch risk J(f) = 1 l t=1 δ(y t )c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 2 (f(x s ) f(x t )) 2 w st s,t=1 Instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges between x t and all previous examples) i=1 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 6 / 18

(f) = l δ(y t)c(f(x t ), y t ) + λ 1 2 f 2 K + λ 2 t (f(x i ) f(x t )) 2 w it (includes graph edges

7 Online convex programming Instead of minimizing convex J(f), reduce convex J t (f) at each step t. J t (f) f t+1 = f t η t f Remarkable no regret guarantee against adversary: Accuracy can be arbitrarily bad if adversary flips target often If so, no batch learner in hindsight can do well either ft regret 1 J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup 1 t=1 J t(f t ) J(f ) 0. If no adversary (iid), the average classifier f = 1/ t=1 f t is good: J( f) J(f ). Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 7 / 18

often If so, no batch learner in hindsight can do well either ft regret 1 J t (f t ) J(f ) t=1 [Zinkevich ICML03] No regret: lim sup 1

8 Kernelized algorithm Init: t = 1, f 1 = 0 Repeat t 1 f t ( ) = α (t) i K(x i, ) i=1 1 receive x t, predict f t (x t ) = t 1 2 occasionally receive y t 3 update f t to f t+1 by i=1 α(t) i K(x i, x t ) α (t+1) i = (1 η t λ 1 )α (t) i 2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t (x t ))w it η t l δ(y t)c (f(x t ), y t ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 8 / 18

2η t λ 2 (f t (x i ) f t (x t ))w it, i < t t α (t+1) t = 2η t λ 2 4 store x t, let t = t + 1 i=1 (f t (x i ) f t

9 Sparse approximation he algorithm is impractical space O( ): stores all previous examples time O( 2 ): each new example compared to all previous ones wo ways to speed up: buffering, or random projection tree Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 9 / 18

previous ones wo ways to speed up: buffering, or random projection

10 Sparse approximation 1: buffering Keep a size τ buffer approximate representers: f t = t 1 i=t τ α(t) i K(x i, ) approximate instantaneous risk J t (f) = l δ(y t)c(f(x t ), y t ) + λ 1 t 2 f 2 K + λ 2 τ dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2 w it i=t τ Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 10 / 18

) + λ 1 t 2 f 2 K + λ 2 τ dynamic graph on examples in the buffer t (f(x i ) f(x t )) 2

11 Sparse approximation 1: buffer update At each step, start with the current τ representers: f t = t 1 i=t τ Gradient descent on τ + 1 terms: f = α (t) i K(x i, ) + 0K(x t, ) t i=t τ Reduce to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2 α (t+1) K(x i, ) by Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 11 / 18

to τ representers f t+1 = t Kernel matching pursuit α ik(x i, ) i=t τ+1 α(t+1) i min f f t+1 2

12 Sparse approximation 2: random projection tree [Dasgupta and Freund, SOC08] Discretize data manifold by online clustering. When a cluster accumulates enough examples, split along random hyperplane. Extends k-d tree. Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 12 / 18

When a cluster accumulates enough examples, split along random hyperplane.

13 Sparse approximation 2: random projection tree We use the clusters N (µ i, Σ i ) as representers: f t = s i=1 β (t) i K(µ i, ) Cluster graph edge weight between a cluster µ i and example x t is [ w µi t = E x N (µi,σ i ) exp ( x x t 2 )] 2σ 2 = (2π) d 2 Σi 1 2 Σ Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t. µ i + x t Σ 1 0 x t µ Σ µ ) ) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 13 / 18

Σi 1 2 Σ0 1 1 2 Σ 2 ( exp A further approximation is ( 1 2 µ i Σ 1 i w µi t = e µ i x t 2 /2σ 2 Update f (i.e., β) and the RPtree, discard x t.

14 Experiment: runtime Buffering and RPtree scales linearly, enabling life-long learning. Spirals MNIS 0 vs. 1 ime (seconds) Batch MR Online MR Online MR (buffer) Online RPtree Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 14 / 18

1 ime (seconds) 500 400 300 200 100 500 400 300 200 100 Batch MR Online MR Online

15 Experiment: risk Online MR risk J air ( ) 1 t=1 J t(f t ) approaches batch risk J(f ) as increases. Risk J(f*) Batch MR J air () Online MR J air () Online MR (buffer) J air () Online RPtree Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 15 / 18

8 J(f*) Batch MR J air () Online MR J air () Online MR (buffer) J air () Online RPtree 0.

16 Experiment: generalization error of f if iid A variation of buffering as good as batch MR (preferentially keep labeled examples, but not their labels, in buffer). Generalization error rate Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Online RPtree (PPK) Generalization error rate Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree (a) Spirals (b) Face Generalization error rate Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree Generalization error rate Batch MR Online MR Online MR (buffer) Online MR (buffer U) Online RPtree (c) MNIS 0 vs. 1 (d) MNIS 1 vs. 2 Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 16 / 18

05 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 0 0 2000 4000 6000 8000 10000 (a) Spirals (b) Face Generalization error rate 0.4 0.35 0.3 0.25 0.2 0.15 0.

17 Experiment: adversarial concept drift Slowly rotating spirals, both p(x) and p(y x) changing. Batch f vs. online MR buffering f est set drawn from the current p(x, y) at time. 0.7 Batch MR Online MR (buffer) 0.6 Generalization error rate Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 17 / 18

7 Batch MR Online MR (buffer) 0.6 Generalization error rate 0.5 0.4 0.3 0.2 0.

18 Conclusions Online semi-supervised learning framework Sparse approximations: buffering and RPtree Future work: new bounds, new algorithms (e.g., S3VM) Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised Learning 18 / 18

new bounds, new algorithms (e.g., S3VM) Xiaojin Zhu (Univ.

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery