LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014



Similar documents
Learning with Local and Global Consistency

Learning with Local and Global Consistency

DATA ANALYSIS II. Matrix Algorithms

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Online Semi-Supervised Learning

Tree based ensemble models regularization by convex optimization

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

MapReduce Approach to Collective Classification for Networks

Statistical machine learning, high dimension and big data

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Part 2: Community Detection

Graph-based Semi-supervised Learning

Approximation Algorithms

Lecture 3: Linear methods for classification

Big Data Analytics CSCI 4030

Big Data Analytics: Optimization and Randomization

MACHINE LEARNING IN HIGH ENERGY PHYSICS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Threshold Units

A Tutorial on Spectral Clustering

2.3 Convex Constrained Optimization Problems

Notes on Symmetric Matrices

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Machine Learning and Pattern Recognition Logistic Regression

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Neural Networks Lesson 5 - Cluster Analysis

Segmentation & Clustering

Factor analysis. Angela Montanari

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Variational approach to restore point-like and curve-like singularities in imaging

Nonlinear Optimization: Algorithms 3: Interior-point methods

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

An Introduction to Machine Learning

Big Data - Lecture 1 Optimization reminders

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Predict Influencers in the Social Network

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Performance Metrics for Graph Mining Tasks

Part 1: Link Analysis & Page Rank

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Convolution. 1D Formula: 2D Formula: Example on the web:

Environmental Remote Sensing GEOG 2021

A Network Flow Approach in Cloud Computing

A Learning Based Method for Super-Resolution of Low Resolution Images

Manifold Learning Examples PCA, LLE and ISOMAP

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Social Media Mining. Data Mining Essentials

Network Intrusion Detection using Semi Supervised Support Vector Machine

Classification Problems

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Unsupervised Data Mining (Clustering)

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Follow the Perturbed Leader

Solutions to Homework 6

Supervised Learning (Big Data Analytics)

Cluster Analysis: Advanced Concepts

Large-Scale Similarity and Distance Metric Learning

STA 4273H: Statistical Machine Learning

Local classification and local likelihoods

Maximum Margin Clustering

Mathematics Course 111: Algebra I Part IV: Vector Spaces

10. Proximal point method

Geometric-Guided Label Propagation for Moving Object Detection

Binary Image Reconstruction

Factorization Theorems

1 Maximum likelihood estimation

Vector and Matrix Norms

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Component Ordering in Independent Component Analysis Based on Data Power

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Bayesian Active Distance Metric Learning

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Lecture 2 Linear functions and examples

How To Cluster

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Data Mining: Algorithms and Applications Matrix Math Review

Lecture 2: The SVM classifier

Least-Squares Intersection of Lines

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Distributed Machine Learning and Big Data

Discrete Optimization

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Lecture 10: Regression Trees

Linear Programming. March 14, 2014

Applied Algorithm Design Lecture 5

Transcription:

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014

Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph Kernels by Spectral Transforms Gaussian field and Harmonic Function Reference

Semi Supervised Learning Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Why Labeled data is hard to get Expensive, human annotation, time consuming May require experts Unlabeled data is cheap

Why unlabeled data helps[1] assuming each class is a coherent group (e.g. Gaussian) with and without unlabeled data: decision boundary shift

Label Propagation[2] Assumption Closer data points tend to have similar class labels. General Idea A node s labels propagate to neighboring nodes according to their proximity Clamp the labels on the labeled data, so the labeled data could act like a sources that push out labels to unlabeled data.

Set up Input x, label y Labeled data (x 1,y 1 ),(x 2, y 2 ) (x l,y l ) Unlabeled data (x l+1,y l+1 ) (x l+u,y l+u ) l<<u weight

Probabilistic Transition Matrix Allow larger edge weight to propagate labels easier w ij T ij = P j i = l+u k=1 w kj T ij is the probability to jump from node j to I Normalized T For example: The probability node 2 jump to Node 3 is 0.7 T ij = T ij k T ik 0.3 0.2 0.5 0.2 0.1 0.7 0.5 0.7 0.6 The probability node 3 jump to Node 1 is 0.5

Matrix Y Define (l+u) * C label matrix Y, whose ith row representing label probability distribution of node x i Y ij =1, if the class of x i is c j, else 0, for labeled data The initialization of row of Y corresponding to unlabeled data is not important 0 1 0 0.2 0.5 0.3 0.7 0.2 0.1 Node 1 is labeled as label 2. The label distribution of node 3. For example, 0.7 is the probability that node 3 is label 1.

Algorithm 1 Propagate Y TY Labels spread information along local structure 2 Row normalize Y Keep proper distribution over classes 3 Clamp the labeled data, Repeat from step 1 until Y converges Keep originally labeled points

Convergence The first two steps Split T Yu General from T is row normalized, T uu is submatrix of T

Convergence(cont) Consider the row sum No need to iterate!

Parameter Setting How to choose parameter σ First, use a heuristic method. Finding a minimum spanning tree over all data points with Euclidean distances d ij with Kruskal s Algorithm(The famous greedy algorithm in data structure). Choose the first tree edge that connect two components with different labeled points. The length is d 0. Set σ=d 0 /3

The effect of σ

Optimizing σ Single parameter σ controls spread of labels For σ 0, classification of unlabeled points dominated by nearest labeled point For σ, class probabilities just become class frequencies (no information from label proximity) Can minimize entropy of class labels H=- ij Y ij logy ij Leads to confident classifications However, minimum entropy at σ=0

Optimizing σ(cont) Add uniform transition component (U ij =1/N) to T ~ T U (1 )T For small σ, uniform component dominates Minimum entropy no longer at σ=0 Use σ 1 σ N to scale each dimension independently Perform gradient descent with respect to σ s in order to minimize entropy H d H Y L U C ic il1 c1 Yic d

Rebalancing Class Proportions How should we assign classes to unlabeled points? Could choose most likely class ML method does not explicitly control class proportions Suppose we want labels to fit a known or estimated distribution over classes Normalize class mass scale columns of Y U to fit class distribution and then pick ML class Does not guarantee strict label proportions Perform label bidding each entry Y U (i,c) is a bid of sample i for class c Handle bids from largest to smallest Bid is taken if class c is not full, otherwise it is discarded

Experiment result 3 bands dataset and Spring dataset

Learning with local and global consistency[4] The key to semi- supervised learning Nearby points are likely to have the same label Points on the same structure (cluster or manifold) are likely to have the same label

A toy example

Objective Design a classifying function which is sufficiently smooth with respect to the intrinsic structure

Algorithm Receive information from its neighbour Retain Initial information

Convergence The sequence {F(t)} converges, suppose F(0)=Y F*=(1-α)(I-αS) -1 Y The proof is similar to Label Propagation

Regularization Framework(A different pespective) Q(F)= 1 2 n i,j=1 W ij F i D ii F j n 2 2 + μ F D i=1 i Y i jj Smoothness term, capture the local variations, a good function should not change too much between nearby points Fitting constraints, loss function, a good classifying function should not change too much from initial label assignment

Regularization Framework Property of Laplacian Matrix 1 2 Differentiating Q(F) with respect to F Q n i,j=1 W ij F i D ii F j D jj 2 = F SF + μ F Y F F=F = 0 F 1 1+μ SF μ 1+μ α = 1, and β = μ 1+μ 1+μ I αs F = βy I αs is invertible, F = β(i αs) 1 Y = f T D 1/2 L sym D 1/2 f L sym = D 1/2 LD 1/2 =I-D 1/2 WD 1/2 =I-S

Experiment

Graph Kernels by Spectral Transforms[5] Graph-based semi-supervised learning methods can be viewed as imposing smoothness conditions on the target function Eigenvectors with small eigenvalues are smooth, and ideally represent large cluster structures within the data.

Smoothness Consider the Laplacian L

Smoothness Semi-supervised learning creates a smooth function over unlabeled points f T Lf 1 2 N i, j1 W ij ( f ( i) f ( j)) 2 Generally, smooth if f(i) f(j) for pairs with large W ij The smoothness of an eigenvector is Eigenvectors with smaller eigenvalues are smoother

Smoothness of Eigenvectors The complete orthonormal set of eigenvectors ø 1, n ø 2.. ø n L = λi i=1 φ i φ i T

Kernels by Spectral Transform Different weightings (i.e. spectral transforms) of Laplacian eigenvalues leads to different smoothness measures We want a kernel K that respects smoothness Define using eigenvectors of Laplacian (φ) and eigenvalues of K (μ) K N i1 Can also define in terms of a spectral transform of Laplacian eigenvalues K N i1 i i i i T i r( ) T i

Types of Transforms r(λ i ) is a non-negative and decreasing transform Regularize d Laplacian Diffusion Kernel 1- step Random Walk p - step Random Walk Inverse Cosine Step Function 1 r( ) 2 r( ) exp 2 r( ) ( ), 2 r( ) r( ) r( ) p ( ), 2 cos( / 4) 1if Reverses order of eigenvalues, so smooth eigenvectors have larger eigenvalues in K Is there an optimal transform? cut We need to find a regularizer here

Kernel Alignment Assess fitness of a kernel to training labels Empirical kernel alignment compares kernel matrix K tr for training data to target matrix T for training data T ij =1 if y i =y j, otherwise T ij =-1 Aˆ ( K tr, T) K tr K tr, K tr, T F F T, T M, N Tr( MN) F Alignment measure computes cosine between K tr and T Find the optimal spectral transformation r(λ i ) using the kernel alignment notion F Frobenius Product

Convex Optimization Convex set Convex function Convex Optimization Linear Programming Minimize c T x+d Subject to Gx h Ax=b Quadratically Constrained Quadratic Programming Minimize 1/2x T Px+c T x+d Subject to 1/2x T Q i X+r it x+s i =0 Ax=b

QCQP Kernel alignment between K tr and T is a convex function of kernel eigenvalues μ i No assumption on parametric form of transform r(λ i ) Need K to be positive semi-definite Restrict eigenvalues of K to be 0 Leads to computationally efficient Quadratically Constrained Quadratic Program Minimize convex quadratic function over smaller feasible region Both objective function and constraints are quadratic Complexity comparable to linear programs

Impose Order Constraints We would like to keep decreasing order on spectral transformation Smooth functions are preferred bigger eigenvalues for smoother eigenvectors An order constrained semi-supervised kernel K is the solution to the following convex optimization problem.

Improved Order Constraints Constant eigenvectors act as a bias term in the graph kernel λ 1 =0, corresponding eigenvector φ i is constant Need not constrain bias terms Improved Order constrains Ignore the constant eigenvectors μ i μ i+1, i = 1.. n 1, and i not constant

Harmonic Functions (Zhu, 2003)[3] Now define class labeling f in terms of a Gaussian over continuous space, instead of random field over discrete label set Distribution on f is a Gaussian field p Z E( f ) e ( f ) Z exp( E( f )) df f L f l Useful for multi-label problems (NP-hard for discrete random fields) ML configuration is now unique, attainable by matrix methods, and characterized by harmonic functions

Harmonic Energy Energy of solution labeling f is defined as: 1 E( f ) wij ( f ( i) f ( j)) 2 i, j Nearby points should have similar labels Solution which minimizes E(f) is harmonic Δf=0 for unlabeled points, where Δ=D-W (combinatorial Laplacian) Δf=f l for labeled points Value of f at an unlabeled point is the average of f at neighboring points 1 f ( j) wij f ( i), for j L 1,, L U d j i~ j 1 f D Wf 2

Harmonic Solution As before, split problem into: f f f Solve using Δf=0, f L = f l : f u ( D uu W W uu ) 1 W Can be viewed as heat kernel classification, but independent of time parameter ul f l W l ll lu 1 u W W ul W uu ( I P uu ) 1 P ul P f l D W

Summary Label Propagation Propagate and clamp data Local and global consistency Allow f(x l ) to be different from Y l, but penalize it Introduce a balance between labeled data fit and graph energy Graph Kernels by Spectral Transforms Smoothness, using eigenvector of Laplacian to keep smooth Use kernel alignment Gaussian field and Harmonic Function The label is descrete (Gaussian)

Reference [1] Zhu, Semi supervised learning tutorial (http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf) [2] Zhu, Ghahramani Learning from labeled and unlabeled data [3]Zhu, Ghahramani, Lafferty Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions [4]Zhou at al Learning with Local and Global Consistency [5]Zhu et al Semi-supervised learning [6]Matt Stokes, Semi-Supervised Learning