Semi-Supervised Support Vector Machines and Application to Spam Filtering



Similar documents
Support Vector Machine (SVM)

Online Semi-Supervised Learning

Tree based ensemble models regularization by convex optimization

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

An Introduction to Machine Learning

Statistical Machine Learning

Maximum Margin Clustering

Simple and efficient online algorithms for real world applications

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Graph-based Semi-supervised Learning

Support Vector Machines Explained

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Christfried Webers. Canberra February June 2015

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Lecture 6: Logistic Regression

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture 3: Linear methods for classification

: Introduction to Machine Learning Dr. Rita Osadchy

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Data Mining - Evaluation of Classifiers

Support Vector Machines with Clustering for Training with Very Large Datasets

Predict Influencers in the Social Network

Several Views of Support Vector Machines

Early defect identification of semiconductor processes using machine learning

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

A semi-supervised Spam mail detector

Linear Threshold Units

Classification with Hybrid Generative/Discriminative Models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

1 Maximum likelihood estimation

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Machine Learning in Spam Filtering

A Simple Introduction to Support Vector Machines

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Supervised Learning (Big Data Analytics)

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Distributed Structured Prediction for Big Data

Interactive Machine Learning. Maria-Florina Balcan

Social Media Mining. Data Mining Essentials

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Lecture 2: The SVM classifier

Direct Loss Minimization for Structured Prediction

Robust personalizable spam filtering via local and global discrimination modeling

Making Sense of the Mayhem: Machine Learning and March Madness

Data Mining. Nonlinear Classification

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Big Data - Lecture 1 Optimization reminders

Nonlinear Optimization: Algorithms 3: Interior-point methods

CSE 473: Artificial Intelligence Autumn 2010

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Big Data Analytics CSCI 4030

Logistic Regression for Spam Filtering

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Model Trees for Classification of Hybrid Data Types

Support Vector Machines

CSCI567 Machine Learning (Fall 2014)

Statistical Machine Learning from Data

Machine Learning.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Discrete Optimization

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Local features and matching. Image classification & object localization

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Follow the Perturbed Leader

Maximum-Margin Matrix Factorization

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Robust Personalizable Spam Filtering Via Local and Global Discrimination Modeling

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Decompose Error Rate into components, some of which can be measured on unlabeled data

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to Online Learning Theory

Cross-validation for detecting and preventing overfitting

The Artificial Prediction Market

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Gaussian Processes in Machine Learning

Linear Models for Classification

Probabilistic user behavior models in online stores for recommender systems

MACHINE LEARNING IN HIGH ENERGY PHYSICS

On Entropy in Network Traffic Anomaly Detection

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Network Intrusion Detection using Semi Supervised Support Vector Machine

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Visualization by Linear Projections as Information Retrieval

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Semi-supervised Sentiment Classification in Resource-Scarce Language: A Comparative Study

Predicting the Stock Market with News Articles

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Transcription:

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery Challenge

1 Introduction 2 Training a S 3 VM Why It Matters Some S 3 VM Training Methods Gradient-based Optimization The Continuation S 3 VM 3 Overview of SSL Assumptions of SSL A Crude Overview of SSL Combining Methods 4 Application to Spam Filtering Naive Application Proper Model Selection 5 Conclusions

find a linear classification boundary

not robust wrt input noise!

SVM: maximum margin classifier w min w,b 1 2 w w } {{ } regularizer s.t. y i (w x i + b) 1

S 3 VM (TSVM): semi-supervised (transductive) SVM min w,b,(y j ) 1 2 w w } {{ } regularizer s.t. y i (w x i + b) 1 y j (w x j + b) 1

soft margin S 3 VM min w,b,(y j ),(ξ k ) 1 2 w w +C i ξ i +C j ξ j s.t. ξ i 0 ξ j 0 y i (w x i + b) 1 ξ i y j (w x j + b) 1 ξ j

Two Moons toy data easy for human (0% error) hard for S 3 VMs! S 3 VM optimization method test error objective value global min. {Branch & Bound 0.0% 7.81 find local minima CCCP S 3 VM light S 3 VM cs 3 VM 64.0% 66.2% 59.3% 45.7% 39.55 20.94 13.64 13.25 objective function is good for SSL try to find better local minima!

min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Mixed Integer Programming [Bennett, Demiriz; NIPS 1998] global optimum found by standard optimization packages (eg CPLEX) combinatorial & NP-hard! only works for small sized problems

min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 S 3 VM light [T. Joachims; ICML 1999] train SVM on labeled points, predict y j s in prediction, always make sure that #{y j = +1} # unlabeled points = #{y i = +1} # labeled points (1) with stepwise increasing C do 1 train SVM on all points, using labels (y i ), (y j ) 2 predict new y j s s.t. balancing constraint (*)

min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Balancing constraint required to avoid degenerate solutions!

min w,b,(y j ),(ξ k ) s.t. Effective Loss Functions ξ j = { ξ i = min min y j {+1, 1} 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 } 1 y i (w x i + b), 0 { } 1 y j (w x j + b), 0 loss functions ξ i 0 1 0 1 y i (w x i + b) ξ j 0 1 0 1 w x j + b

min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 Resolving the Constraints 1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b loss functions l l 0 1 0 1 l u 0 1 0 1

1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b CCCP-S 3 VM [R. Collobert et al.; ICML 2006] CCCP: Concave Convex Procedure objective = convex function + concave function starting from SVM solution, iterate: 1 approximate concave part by linear function at given point 2 solve resulting convex problem [Fung, Mangasarian; 1999] similar approach restricted to linear S 3 VMs

1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b S 3 VM as Unconstrained Differentiable Optimization Problem original loss functions l l 0 0 1 l u 0 1 0 1 smooth loss functions l l 0 0 1 l u 0 1 0 1

1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b S 3 VM [Chapelle, Zien; AISTATS 2005] simply do gradient descent! thereby stepwise increase C conts 3 VM [Chapelle et al.; ICML 2006]... in more detail on next slides!

1 2 w w + C i ) l l (y i (w x i + b) + C j ) l u (w x j + b Hard Balancing Constraint S 3 VM light constraint equivalent constraint #{y j = +1} # unlabeled points = 1 ( ) sign w x j + b m j } {{ } average prediction #{y i = +1} # labeled points = 1 y i n i } {{ } average label

Making the Balancing Constraint Linear hard / non-linear soft / linear 1 ( ) sign w x j + b m j } {{ } average prediction = 1 w x j + b m j } {{ } mean output on unlabeled points 1 y i n i } {{ } average label = 1 y i n i } {{ } average label Implementing the linear soft balancing: center the unlabeled data: j x j = 0 just fix b; unconstrained optimization over w!

The Continuation Method in a Nutshell Procedure 1 smooth function until convex 2 find minimum 3 track minimum while decreasing amount of smoothing Illustration

Smoothing the S 3 VM Objective f ( ) Convolution of f ( ) with Gaussian of width γ/2: f γ (w) = (πγ) d/2 f (w t) exp( t 2 /γ)dt Closed form solution! Smoothing Sequence choose γ 0 > γ 1 >... γ p 1 > γ p = 0 choose γ 0 such that f γ0 ( ) is convex choose γ p 1 such that f γp 1 ( ) f γp ( ) = f ( ) p = 10 steps (equidistant on log scale) sufficient

Handling Non-Linearity Consider non-linear map Φ(x), kernel k(x i, x j ) = Φ(x i ) Φ(x j ). Representer Theorem: S 3 VM solution is in span E of data points E := span{φ(x i )} = R n+m Implementation 1 expand basis vectors v i of E: v i = k A ik Φ(x k ) 2 orthonormality gives: (A A) 1 = K solve for A, eg by KPCA or Choleski 3 project data Φ(x i ) on basis V = (v j ) j : x i = V Φ(x i ) = (A) i

Comparison of S 3 VM Optimization Methods averaged over splits (and pairs of classes) fixed hyperparams (close to hard margin) similar results for other hyperparameter settings [Chapelle, Chi, Zien; ICML 2006]

Why would unlabeled data be useful at all? Uniform data do not help.

Why would unlabeled data be useful at all? Uniform data do not help.

Cluster Assumption Points in the same cluster are likely to be of the same class. Algorithmic idea: Low Density Separation

Manifold Assumption The data lie on (close to) a low-dimensional manifold. [images from The Geometric Basis of Semi-Supervised Learning, Sindhwani, Belkin, Niyogi in Semi-Supervised Learning Chapelle, Schölkopf, Zien] Algorithmic idea: use Nearest-Neighbor Graph

Assumption: Independent Views Exist There exist subsets of features, called views, each of which is independent of the others given the class; is sufficient for classification. view 2 view 1 Algorithmic idea: Co-Training

Assumption Approach Example Algorithm Cluster Assumption Low Density Separation S 3 VM; Entropy Regularization; Data-Dependent Regularization;... Manifold Assumption Graphbased Methods build weighted graph (w kl ) min w kl (y k y l ) 2 (y j ) k l relax y j to be real QP Independent Views Co-Training train two predictors y (1) j, y (2) j couple objectives by adding j ( y (1) j y (2) j ) 2

Discriminative Learning (Diagnostic Paradigm) model p(y x) (or just boundary: { x p(y x) = 1 2 } ) examples: S 3 VM, graph-based methods Generative Learning (Sampling Paradigm) model p(x y) predict via Bayes: p(y x) = missing data problem p(y)p(x y) y p(y )p(x y ) EM algorithm (expectation-maximization) is a natural tool successful for text data [Nigam et al.; Machine Learning, 2000]

SSL Book MIT Press, Sept. 2006 edited by B. Schölkopf, O. Chapelle, A. Zien contains many state-of-art algorithms by top researchers extensive SSL benchmark online material: sample chapters benchmark data more information http://www.kyb.tuebingen.mpg.de/ssl-book/

SSL Book Text Benchmark error [%] AUC [%] l=10 l=100 l=10 l=100 1-NN 38.12 30.11 SVM 45.37 26.45 67.97 84.26 MVU + 1-NN 45.32 32.83 LEM + 1-NN 39.44 30.77 QC + CMN 40.79 25.71 70.71 84.62 Discrete Reg. 40.37 24.00 53.79 71.53 TSVM 31.21 24.52 73.42 80.96 SGT 29.02 23.09 80.09 85.22 Cluster-Kernel 42.72 24.38 73.09 85.90 LDS 27.15 23.15 80.68 84.77 Laplacian RLS 33.68 23.57 76.55 85.05

Combining S 3 VM with Graph-based Regularizer LapSVM [1]: modify kernel using graph, then train SVM combination with S 3 VM even better [2] MNIST, 3 vs 5 [1] Beyond the Point Clound ; Sindhwani, Niyogi, Belkin; ICML 2005 [2] A Continuation Method for S 3 VM ; Chapelle, Chi, Zien; ICML 2006 Combining S 3 VM with Co-Training SSL for Structured Output Variables ; Brefeld, Scheffer; ICML 2006

min w,b,(y j ),(ξ k ) s.t. 1 2 w w + C i ξ i + C j ξ j y i (w x i + b) 1 ξ i ξ i 0 y j (w x j + b) 1 ξ j ξ j 0 How to set C? data fitting, y i w x i 1, and regularization, min w 2 : w x i = O(1) w 2 Var[x] 1 balance influence: w 2 Cξ i C Var[x] 1 How to set C? C = C C # unlabeled points = λ # labeled points C

Naive Application: Transductive setting on each user/inbox: use inbox of given user as unlabeled data test data = unlabeled data Guess the model: Var[x] 1, so set C = 1 C = C linear kernel Results: AUC (rank) [rank in unofficial list] task A task B S 3 VM light 94.53% (4) [6] 92.34% (2) [4] S 3 VM 96.72% (1) [3] 93.74% (2) [4] conts 3 VM 96.01% (1) [3] 93.56% (2) [4]

Model selection: C {10 2, 10 1, 10 0, 10 +1, 10 +2 } C {10 2, 10 1, 10 0, 10 +1, 10 +2 } C cross-validation (3-fold for task A; 5-fold for task B) Results: AUC for conts 3 VM task A task B C = C = 1 (guessed model) 96.01% 93.56% model selection 89.31% 90.09% significant drop in accuracy! CV relys on iid assumption: that the data are independent identically distributed

Take Home Messages S 3 VM implements low density separation (margin maximization) optimization technique matters (non-convex objective) works well for text classification (texts form clusters) S 3 VM-based hybrids may be even better for spam filtering, further methods needed to cope with non-iid situation (mail inboxes)! Thank you!