Semi-Supervised Support Vector Machines and Application to Spam Filtering

Size: px

Start display at page:

Download "Semi-Supervised Support Vector Machines and Application to Spam Filtering"

Mervin Hardy
7 years ago
Views:

Inference Department, Bernhard Schölkopf Max Planck

1 Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery Challenge

2 1 Introduction 2 Training a S 3 VM Why It Matters Some S 3 VM Training Methods Gradient-based Optimization The Continuation S 3 VM 3 Overview of SSL Assumptions of SSL A Crude Overview of SSL Combining Methods 4 Application to Spam Filtering Naive Application Proper Model Selection 5 Conclusions

SSL Assumptions of SSL A Crude Overview of SSL Combining Methods 4

3 find a linear classification boundary

5 not robust wrt input noise!

6 SVM: maximum margin classifier w min w,b 1 2 w w } {{ } regularizer s.t. y i (w x i + b) 1

7 S 3 VM (TSVM): semi-supervised (transductive) SVM min w,b,(y j ) 1 2 w w } {{ } regularizer s.t. y i (w x i + b) 1 y j (w x j + b) 1

8 soft margin S 3 VM min w,b,(y j ),(ξ k ) 1 2 w w +C i ξ i +C j ξ j s.t. ξ i 0 ξ j 0 y i (w x i + b) 1 ξ i y j (w x j + b) 1 ξ j

9 Two Moons toy data easy for human (0% error) hard for S 3 VMs! S 3 VM optimization method test error objective value global min. {Branch & Bound 0.0% 7.81 find local minima CCCP S 3 VM light S 3 VM cs 3 VM 64.0% 66.2% 59.3% 45.7% objective function is good for SSL try to find better local minima!

{Branch & Bound 0.0% 7.81 find local minima CCCP S 3 VM light S 3 VM cs 3 VM 64.

10 min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Mixed Integer Programming [Bennett, Demiriz; NIPS 1998] global optimum found by standard optimization packages (eg CPLEX) combinatorial & NP-hard! only works for small sized problems

11 min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 S 3 VM light [T. Joachims; ICML 1999] train SVM on labeled points, predict y j s in prediction, always make sure that #{y j = +1} # unlabeled points = #{y i = +1} # labeled points (1) with stepwise increasing C do 1 train SVM on all points, using labels (y i ), (y j ) 2 predict new y j s s.t. balancing constraint (*)

Joachims; ICML 1999] train SVM on labeled points, predict y j s in prediction, always make sure that #{y j

12 min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Balancing constraint required to avoid degenerate solutions!

13 min w,b,(y j ),(ξ k ) s.t. Effective Loss Functions ξ j = { ξ i = min min y j {+1, 1} 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 } 1 y i (w x i + b), 0 { } 1 y j (w x j + b), 0 loss functions ξ i y i (w x i + b) ξ j w x j + b

14 min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Resolving the Constraints 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b loss functions l l l u

15 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b CCCP-S 3 VM [R. Collobert et al.; ICML 2006] CCCP: Concave Convex Procedure objective = convex function + concave function starting from SVM solution, iterate: 1 approximate concave part by linear function at given point 2 solve resulting convex problem [Fung, Mangasarian; 1999] similar approach restricted to linear S 3 VMs

starting from SVM solution, iterate: 1 approximate concave part by linear function at given

16 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b S 3 VM as Unconstrained Differentiable Optimization Problem original loss functions l l l u smooth loss functions l l l u

Optimization Problem original loss functions l l 0 0

17 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b S 3 VM [Chapelle, Zien; AISTATS 2005] simply do gradient descent! thereby stepwise increase C conts 3 VM [Chapelle et al.; ICML 2006]... in more detail on next slides!

18 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b Hard Balancing Constraint S 3 VM light constraint equivalent constraint #{y j = +1} # unlabeled points = 1 ( ) sign w x j + b m j } {{ } average prediction #{y i = +1} # labeled points = 1 y i n i } {{ } average label

19 Making the Balancing Constraint Linear hard / non-linear soft / linear 1 ( ) sign w x j + b m j } {{ } average prediction = 1 w x j + b m j } {{ } mean output on unlabeled points 1 y i n i } {{ } average label = 1 y i n i } {{ } average label Implementing the linear soft balancing: center the unlabeled data: j x j = 0 just fix b; unconstrained optimization over w!

i n i } {{ } average label = 1 y i n i } {{ } average label Implementing the linear soft

20 The Continuation Method in a Nutshell Procedure 1 smooth function until convex 2 find minimum 3 track minimum while decreasing amount of smoothing Illustration

21 Smoothing the S 3 VM Objective f ( ) Convolution of f ( ) with Gaussian of width γ/2: f γ (w) = (πγ) d/2 f (w t) exp( t 2 /γ)dt Closed form solution! Smoothing Sequence choose γ 0 > γ 1 >... γ p 1 > γ p = 0 choose γ 0 such that f γ0 ( ) is convex choose γ p 1 such that f γp 1 ( ) f γp ( ) = f ( ) p = 10 steps (equidistant on log scale) sufficient

22 Handling Non-Linearity Consider non-linear map Φ(x), kernel k(x i, x j ) = Φ(x i ) Φ(x j ). Representer Theorem: S 3 VM solution is in span E of data points E := span{φ(x i )} = R n+m Implementation 1 expand basis vectors v i of E: v i = k A ik Φ(x k ) 2 orthonormality gives: (A A) 1 = K solve for A, eg by KPCA or Choleski 3 project data Φ(x i ) on basis V = (v j ) j : x i = V Φ(x i ) = (A) i

23 Comparison of S 3 VM Optimization Methods averaged over splits (and pairs of classes) fixed hyperparams (close to hard margin) similar results for other hyperparameter settings [Chapelle, Chi, Zien; ICML 2006]

24 Why would unlabeled data be useful at all? Uniform data do not help.

25 Why would unlabeled data be useful at all? Uniform data do not help.

26 Cluster Assumption Points in the same cluster are likely to be of the same class. Algorithmic idea: Low Density Separation

27 Manifold Assumption The data lie on (close to) a low-dimensional manifold. [images from The Geometric Basis of Semi-Supervised Learning, Sindhwani, Belkin, Niyogi in Semi-Supervised Learning Chapelle, Schölkopf, Zien] Algorithmic idea: use Nearest-Neighbor Graph

28 Assumption: Independent Views Exist There exist subsets of features, called views, each of which is independent of the others given the class; is sufficient for classification. view 2 view 1 Algorithmic idea: Co-Training

29 Assumption Approach Example Algorithm Cluster Assumption Low Density Separation S 3 VM; Entropy Regularization; Data-Dependent Regularization;... Manifold Assumption Graphbased Methods build weighted graph (w kl ) min w kl (y k y l ) 2 (y j ) k l relax y j to be real QP Independent Views Co-Training train two predictors y (1) j, y (2) j couple objectives by adding j ( y (1) j y (2) j ) 2

30 Discriminative Learning (Diagnostic Paradigm) model p(y x) (or just boundary: { x p(y x) = 1 2 } ) examples: S 3 VM, graph-based methods Generative Learning (Sampling Paradigm) model p(x y) predict via Bayes: p(y x) = missing data problem p(y)p(x y) y p(y )p(x y ) EM algorithm (expectation-maximization) is a natural tool successful for text data [Nigam et al.; Machine Learning, 2000]

31 SSL Book MIT Press, Sept edited by B. Schölkopf, O. Chapelle, A. Zien contains many state-of-art algorithms by top researchers extensive SSL benchmark online material: sample chapters benchmark data more information

32 SSL Book Text Benchmark error [%] AUC [%] l=10 l=100 l=10 l=100 1-NN SVM MVU + 1-NN LEM + 1-NN QC + CMN Discrete Reg TSVM SGT Cluster-Kernel LDS Laplacian RLS

33 Combining S 3 VM with Graph-based Regularizer LapSVM [1]: modify kernel using graph, then train SVM combination with S 3 VM even better [2] MNIST, 3 vs 5 [1] Beyond the Point Clound ; Sindhwani, Niyogi, Belkin; ICML 2005 [2] A Continuation Method for S 3 VM ; Chapelle, Chi, Zien; ICML 2006 Combining S 3 VM with Co-Training SSL for Structured Output Variables ; Brefeld, Scheffer; ICML 2006

34 min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 How to set C? data fitting, y i w x i 1, and regularization, min w 2 : w x i = O(1) w 2 Var[x] 1 balance influence: w 2 Cξ i C Var[x] 1 How to set C? C = C C # unlabeled points = λ # labeled points C

35 Naive Application: Transductive setting on each user/inbox: use inbox of given user as unlabeled data test data = unlabeled data Guess the model: Var[x] 1, so set C = 1 C = C linear kernel Results: AUC (rank) [rank in unofficial list] task A task B S 3 VM light 94.53% (4) [6] 92.34% (2) [4] S 3 VM 96.72% (1) [3] 93.74% (2) [4] conts 3 VM 96.01% (1) [3] 93.56% (2) [4]

36 Model selection: C {10 2, 10 1, 10 0, 10 +1, } C {10 2, 10 1, 10 0, 10 +1, } C cross-validation (3-fold for task A; 5-fold for task B) Results: AUC for conts 3 VM task A task B C = C = 1 (guessed model) 96.01% 93.56% model selection 89.31% 90.09% significant drop in accuracy! CV relys on iid assumption: that the data are independent identically distributed

37 Take Home Messages S 3 VM implements low density separation (margin maximization) optimization technique matters (non-convex objective) works well for text classification (texts form clusters) S 3 VM-based hybrids may be even better for spam filtering, further methods needed to cope with non-iid situation (mail inboxes)! Thank you!

Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin