Graph-based Semi-supervised Learning

Similar documents

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Semi-Supervised Support Vector Machines and Application to Spam Filtering

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Online Semi-Supervised Learning

MapReduce Approach to Collective Classification for Networks

Maximum Margin Clustering

HT2015: SC4 Statistical Data Mining and Machine Learning

Linear Threshold Units

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Geometric-Guided Label Propagation for Moving Object Detection

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Vision based Vehicle Tracking using a high angle camera

Simple and efficient online algorithms for real world applications

A Learning Based Method for Super-Resolution of Low Resolution Images

A semi-supervised Spam mail detector

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Classification algorithm in Data mining: An Overview

Part 2: Community Detection

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Local features and matching. Image classification & object localization

Tracking Groups of Pedestrians in Video Sequences

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Tree based ensemble models regularization by convex optimization

Social Media Mining. Data Mining Essentials

Principled Hybrids of Generative and Discriminative Models

The Artificial Prediction Market

Analecta Vol. 8, No. 2 ISSN

Canny Edge Detection

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

E-commerce Transaction Anomaly Classification

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

ADVANCED MACHINE LEARNING. Introduction

CSE 473: Artificial Intelligence Autumn 2010

Basics of Statistical Machine Learning

Big Data Analytics CSCI 4030

Component Ordering in Independent Component Analysis Based on Data Power

Probabilistic Latent Semantic Analysis (plsa)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Semi-Supervised Learning for Blog Classification

Maximum Likelihood Graph Structure Estimation with Degree Distributions

Cluster Analysis: Advanced Concepts

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Course: Model, Learning, and Inference: Lecture 5

Tracking and Recognition in Sports Videos

AN IMPROVED DOUBLE CODING LOCAL BINARY PATTERN ALGORITHM FOR FACE RECOGNITION

Bayesian Active Distance Metric Learning

Multi-Class Active Learning for Image Classification

STA 4273H: Statistical Machine Learning

Data Mining - Evaluation of Classifiers

Predict Influencers in the Social Network

Supervised Feature Selection & Unsupervised Dimensionality Reduction

DATA ANALYSIS II. Matrix Algorithms

Christfried Webers. Canberra February June 2015

Edge tracking for motion segmentation and depth ordering

SOLVING LINEAR SYSTEMS

Predicting the End-Price of Online Auctions

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Dirichlet Processes A gentle tutorial

Support Vector Machines Explained

Classification with Hybrid Generative/Discriminative Models

Decision Trees from large Databases: SLIQ

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

The Basics of Graphical Models

CSCI567 Machine Learning (Fall 2014)

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Multi-Class and Structured Classification

Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce

Introduction to Deep Learning Variational Inference, Mean Field Theory

A Content based Spam Filtering Using Optical Back Propagation Technique

Character Image Patterns as Big Data

Decompose Error Rate into components, some of which can be measured on unlabeled data

Statistical Machine Learning

Bayes and Naïve Bayes. cs534-machine Learning

Semi-supervised Sentiment Classification in Resource-Scarce Language: A Comparative Study

Bayesian Image Super-Resolution

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Segmentation and Classification of Online Chats

Neural Networks Lesson 5 - Cluster Analysis

Vehicle Tracking in Occlusion and Clutter

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Convolution. 1D Formula: 2D Formula: Example on the web:

Bayesian Online Learning for Multi-label and Multi-variate Performance Measures

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Gaussian Processes in Machine Learning

Classifying Manipulation Primitives from Visual Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Image Normalization for Illumination Compensation in Facial Images

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Open-Set Face Recognition-based Visitor Interface System

Transcription:

Graph-based Semi-supervised Learning Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS 2012 La Palma

Motivation Large amounts of unlabelled data, small amounts of labelled data Labelling/annotating data is expensive We want supervised learning methods that can use information in the input distribution

Example: Images cat

Classification using Unlabelled Data Assumption: there is information in the data distribution? + + +

Outline Graph-based semi-supervised learning Active graph-based semi-supervised learning Some thoughts on Bayesian semi-supervised learning

Graph-based Semi-supervised Learning Labeled and Unlabeled Data as a Graph Idea: Construct a graph connecting similar data points Let the hidden/observed labels be random variables on the nodes of this graph (i.e. the graph is an MRF) Intuition: Similar data points have similar labels Information propagates from labeled data points Graph encodes intuition Work with Xiaojin Zhu (U Wisconsin) and John Lafferty (CMU)

The Graph nodes: instances in L U. Binary labels y {0, 1} n edges: local similarity. n n symmetric weight matrix W assumed given. energy: E(y) = 1 2 i,j w ij (y i y j ) 2 1 1 0 0 happy, low energy 1 0 0 1 unhappy, high energy

Low energy Label Propagation energy: E(y) = 1 2 i,j w ij (y i y j ) 2 With no labelled data, then y = 1 or y = 0 is a min energy configuration: energy=0 Conditioned on labeled data: energy=4 energy=2 energy=1

Discrete Markov Random Fields E(y) = 1 2 w ij (y i y j ) 2 i,j p(y) exp( E(y)) yl =L y i {0, 1} Graph mincut can find the min energy (MAP) configuration. Problems: computing the probabilities is expensive, multi-class case is also harder to compute, and learning W is very hard. [Zhu & Ghahramani 02] see also [Blum and Chawla 01]

We relaxed this to a Gaussian random fields

Discrete Markov Random Fields, revisited p(y) exp( E(y)) yl =L y i {0, 1}

Gaussian Random Fields p(y) exp( E(y)) yl =L y i R

Gaussian Random Fields p(y) exp( E(y)) yl =L = exp 1 w ij (y i y j ) 2 yl =L 2 i,j = exp ( y y ) yl =L W = w 11... w 1n w n1... w nn D = w1 0 0 wn The Laplacian = D W = [ LL LU UL UU ]

The Laplacian W = w 11... w 1n w n1... w nn D = w1 0 0 wn This is the combinatorial or graph Laplacian = D W = [ LL LU UL UU ] The graph Laplacian plays the same role on graphs as the Laplace operator in other spaces. For example, in a Cartesian coordinate system, the Laplacian is given by sum of second partial derivatives of the function f = f = i 2 f x 2 i

Gaussian Random Fields p(y) exp( E(y)) yl =L = exp 1 w ij (y i y j ) 2 yl =L 2 = exp ( y y ) yl =L The distribution of y U given y L is Gaussian: y U N ( f U, 1 2 ( UU) 1) i,j The mean is f U = ( UU ) 1 UL y L

The Mean f U The mean f U mode of Gaussian Random Field min energy state soft labels, unique harmonic j i f = 0 or f i = w ijf j j i w, i U ij 0 < f i < 1 Related to heat kernels etc. in spectral graph theory.

f U Interpretation: Random Walks P (j i) = w ij k w ik f i = P (reach label 1 from i) 1 i 0

f U Interpretation: Electric Networks f i = volt(i) R = 1 ij w ij 1 +1 volt 0 a

Classification naive: threshold f U at 0.5. Classification often unbalanced. incorporating Class Priors (heuristic) e.g. prior: 90% class 1 minimize subject to E(y) = y y y L = L and fu U = 0.9

OCR Ten Digits ( L U = 4000) 1 0.95 0.9 0.85 accuracy 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 50 100 150 200 labeled set size CMN 1NN thresh

20-Newsgroups (PC vs. MAC, L U = 1943) 1 0.95 0.9 0.85 accuracy 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 20 40 60 80 100 labeled set size CMN thresh VP Threads?

Hyperparameter Learning Learn the graph weights (or hyperparameters): ( w ij = exp ) m (x id x jd ) 2 d=1, length scales; σ 2 d knn unweighted graph, k; ɛnn unweighted graph, ɛ, etc.;

Hyperparameter Learning Minimize entropy on U (maximize label confidence); Evidence maximization with Gaussian process classifiers [tech report CMU-CS-03-175].

Hyperparameter Learning OCR Digits 1 vs. 2, L = 92, U = 2108. H (bits) GF acc start 0.6931 94.70 ± 1.19 % end 0.6542 98.02 ± 0.39 %

An Example Application of Graph-based SSL Person Identification in Webcam Images: An Application of Semi-Supervised Learning Maria-Florina Balcan Avrim Blum Patrick Pakyan Choi John Lafferty Brian Pantano Mugizi Robert Rwebangira Xiaojin Zhu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA NINAMF@CS.CMU.EDU AVRIM@CS.CMU.EDU PAKYAN@CS.CMU.EDU LAFFERTY@CS.CMU.EDU BPANTANO@ANDREW.CMU.EDU RWEBA@CS.CMU.EDU ZHUXJ@CS.CMU.EDU Abstract An application of semi-supervised learning is made to the problem of person identification in low quality webcam images. Using a set of images of ten people collected over a period of four months, the person identification task is posed as a graph-based semi-supervised learning problem, where only a few training images are labeled. The importance of domain knowledge in graph construction is discussed, and experisearch in semi-supervised machine learning. This paper presents an investigation of the problem of person identification in this low quality video data, using webcam images of ten people that were collected over a period of several months. The results highlight the importance of domain knowledge in semi-supervised learning, and clearly demonstrate the advantages of using both labeled and unlabeled data over standard supervised learning. In recent years, there has been a substantial amount of work exploring how best to incorporate unlabeled data into su-

The FreeFoodCam Figure 1. Four typical FreeFoodCam images. dataset, and the formulation of the semi-supervised learn- clothes, hair styles, and one person even grew a beard.

Background Extraction date 10/24 11/13 1/6 1/14 1/20 1/21 1/27 1 128 193 153 474 2 256 193 448 3 288 305 593 4 204 190 394 5 266 41 189 19 515 6 195 34 179 104 512 7 126 163 200 180 70 22 28 789 8 189 66 172 117 15 559 9 189 94 215 69 30 43 640 10 65 143 122 330 total 1841 398 831 1196 384 276 328 5254 Figure 2. Left: mean background image used for background subtraction. Right: breakdown of the 10 subjects by date. 2.1. Data Collection We asked ten volunteers to appear in seven FreeFoodCam takes over four months. Not all participants could show up for every take. The FreeFoodCam is located in the Computer Science lounge, but we received a live camera feed in our office, and took images from the camera whenever a new frame was available. In each take, the participants took turns entering the scene, walking around, and acting naturally, for example by reading the newspaper or chatting with off-camera colleagues, for five to ten minutes per take. As a result, we collected images where the individuals have varying poses and are at a range of distances from the camera. We discarded all frames that were corrupted by electronic noise in the coke machine, or that contained more than one person we processed the foreground area using morphological transforms (Jain, 1989). Further processing was required because the foreground derived from background subtraction often captured only part of the body and contained background areas. We first removed small islands in the foreground by applying the open operation with a 7 pixelwide square. We then connected vertically-separated pixel blocks (such as head and lower torso) using the close operation with a 60-pixel-by-10-pixel rectangular block. Finally, we made sure the foreground contains the entire person by enlarging the foreground to include neighboring pixels by further closing the foreground with a disk of 20 pixels in radius. And because there is only one person in each image, we discarded all but the largest contiguous block of pixels in the processed foreground. Figure 3 shows some processed foreground images.

Foreground Extraction and Face Detection Figure 3. Examples of foregrounds extracted by background subtraction and morphological transforms. Figure 4. Examples of face images detected by the face detector. tion, and only a subset of individuals are present in each day (refer to Table 2 for the breakdown). Overall 34% of the images (1808 out of 5254) do not contain a face. 3. The Graphs Graph-based semi-supervised learning depends critically

A node and its neighbours image 2910 neighbor 1: time edge neighbor 2: color edge neighbor 3: color edge neighbor 4: color edge neighbor 5: face edge Figure 5. A random image and its neighbors in the graph. 2. Color edges. The color histogram is largely determined by a person s apparel. We assume people change clothes on different days, so that the color histogram tends to be unusable across multiple days. Let l be the number of labeled images, u the number of unlabeled images, and n = l + u. The graph isrepresented the n n weight matrix W.LetD be the diagonal degree matrix with D ii = j W ij, anddefinethecombinatorial

A walk on the graph color time face color Figure 7. An example gradient walk on the graph. The walk starts from an unlabeled image, through assorted edges, and ends at a labeled image. function outperforms the linear kernel SVM baseline (Figure 8). The accuracy can be improved if we incorporate class proportion knowledge with the simple CMN heuristic. The class proportion is estimated from labeled data with Laplace (add one) smoothing. To demonstrate the importance of using unlabeled data for edges. In this case, rather than predicting each example to be the unweighted average of its neighbors, the prediction becomes a weighted average. Figure 9(b) shows that by setting weights judiciously (in particular, giving more emphasis to time-edges) one can substantially improve performance, especially on smaller samples. A related problem is to learn the parameters for K-nearest neighbor, i.e. k

Some results 1 0.9 t 1 =2sec,t 2 =12hr,k c =3,k f =1 t 1 =2sec,t 2 =12hr,k c =1,k f =1 SVM linear freefoodcam, harmonic function 1 0.9 t 1 =2sec,t 2 =12hr,k c =3,k f =1 t 1 =2sec,t 2 =12hr,k c =1,k f =1 SVM linear freefoodcam, harmonic function + CMN 0.8 0.8 0.7 0.7 unlabeled set accuracy 0.6 0.5 0.4 unlabeled set accuracy 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 20 40 60 80 100 120 140 160 180 200 labeled set size 0 20 40 60 80 100 120 140 160 180 200 labeled set size (a) harmonic function accuracy (b) harmonic function + CMN accuracy Figure 8. Harmonic function and CMN accuracy on two graphs. Also shown is the SVM linear kernel baseline. (a) The harmonic function algorithm significantly outperforms the linear kernel SVM, demonstrating that the semi-supervised learning algorithm successfully utilizes the unlabeled data to associate people in images with their identities. (b) The semi-supervised learning algorithm classifies even more accurately by incorporating class proportion knowledge through the CMN heuristic. pating in the FreeFoodCam dataset collection. This work was supported in part by the National Science Foundation under grants CCR-0122581 and IIS-0312814. References Schneiderman, H. (2004a). Feature-centric evaluation for efficient cascaded object detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Schneiderman, H. (2004b). Learning a restricted Bayesian

Computation The basic computation involves solving a sparse linear system of equations. f U = ( UU ) 1 UL y L Some ways of solving this for large systems: Conjugate gradients Belief propagation Convert the original graph into a much smaller backbone graph (Zhu and Lafferty 2005)

Other Approaches to Semi-supervised Learning Caveat: This is a very big field, a lot has happened since 2003! Nigam et al. (2000): An EM algorithm for SSL applied to text. Szummer and Jaakkola (2001): SSL using Markov random walks on graphs. Belkin and Niyogi (2002): regularize f by using the top few eigenvectors of the Laplacian Lawrence and Jordan (2005): a Gaussian process approach similar to TSVM using a null category noise model. Zhou et al (2004) use the loss function i (f i y i ) 2 and the normalised graph Laplacian D 1/2 D 1/2 as a regulariser. Transductive SVMs (also called Semi-Supervised Support Vector Machines (S3VM)).

Transductive Support Vector Machines Instead of finding maximum margin between labelled points, optimize over both margin and labels of unlabelled points. + + + + +

Active Semi-Supervised Learning [Zhu, Lafferty, Ghahramani, 2003] Semi-supervised learning uses U to help classification. Active learning (pool based) selects queries in U to ask for labels. Put it together, we have a better query selection criterion than naively selecting the point with maximum label ambiguity.

Active Learning Select a query to minimize the estimated generalization error, not by maximum ambiguity. a 0.5 1 0 B 0.4

Active Learning generalization error err = i U y i =0,1 (sgn(f i ) y i ) P true (y i ) approximation P true (y i = 1) f i estimated generalization error ˆ err = i U min (f i, 1 f i )

Active Learning estimated generalization error after querying x k and receiving label y k err ˆ +(x k,y k ) = i U min ( ) f +(x k,y k ) i, 1 f +(x k,y k ) i re-train is fast for the harmonic function f +(x k,y k ) U = f U + (y k f k ) ( UU) 1 k ( UU ) 1 kk select query k s.t. k = arg min k (1 f k ) err ˆ +(x k,0) + fk err ˆ +(x k,1)

OCR Digits 1 vs. 2 ( L U = 2200) 1 0.9 Accuracy 0.8 0.7 0.6 0.5 Our Method Most Uncertain Query Random Query SVM 5 10 15 20 Labeled set size

20 Newsgroups PC vs. MAC ( L U = 1943) 1 0.9 Our Method Most Uncertain Query Random Query SVM Accuracy 0.8 0.7 0.6 0.5 5 10 15 20 Labeled set size

Part II: Some thoughts on Bayesian semi-supervised learning

Moving forward... We have good methods for transduction. But we don t seem to have a single unified Bayesian framework for inductive SSL. How would we view this problem from a fully Bayesian framework?

Bayesian Semi-Supervised Learning x inputs, y labels: p(x, y) = p(x)p(y x) = p(y)p(x y) Usually we assume some model with parameters: Discriminative: θ x φ p(x, y θ, φ) = p(x θ)p(y x, φ) SSL possible if θ is somehow related to φ, works well when p(y x, φ) is very flexible (e.g. non-parametric, kernel-based). Generative: p(x, y θ, φ) = p(y φ)p(x y, θ) SSL possible but these methods are not currently widely used. y φ y x θ

Bayesian Semi-Supervised Learning Generative: p(x, y θ, φ) = p(y φ)p(x y, θ) Limitations of the Generative approach: Often we don t want to model the full x. (Solution: maybe we can model some features of x?) Our models of p(x y, θ) are usually too inflexible. (Solution: use non-parametric methods?) Some examples: Kemp et al (2003) Semi-supervised learning with trees. Radford Neal s entry using Dirichlet Diffusion trees into the NIPS feature selection competition. From a Bayesian perspective, semi-supervised learning is just another missing data problem!

Summary Semi-supervised learning with harmonic functions Active semi-supervised learning using harmonic functions by minimizing expected generalization error Much research in this area but still some open questions...

References Balcan, M.-F., Blum, A., Choi, P. P., Lafferty, J., Pantano, B., Rwebangira, M. R., & Zhu, X. (2005a). Person identification in webcam images: An application of semi-supervised learning. ICML 2005 Workshop on Learning with Partially Classified Training Data. Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. ICML 18. Joachims, T. (1999). Transductive inference for text classification using support vector machines. ICML 16: 200-209. Lawrence, N. D., & Jordan, M. I. (2005). Semi-supervised learning via Gaussian processes. NIPS 17. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103-134. Seeger, M. (2001). Learning with labeled and unlabeled data (Technical Report). University of Edinburgh. Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov random walks. NIPS 14. Zhou, D., Bousquet, O., Lal, T., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. NIPS 16. Zhu, X., Ghahramani, Z. and Lafferty, J. (2003) Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. ICML 20: 912 919. Zhu, X., & Lafferty, J. (2005). Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. ICML 22. Zhu, X., Lafferty, J. and Ghahramani, Z. (2003) Combining Active Learning and Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. In ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. pp 58 65. Zhu, X. and Goldberg, A.B. (2009) Introduction to Semi-Supervised Learning. Morgan-Claypool.