2 Random walk and electric networks Solution of this form is not new. It is not hard to see for any unlabeled node i, f(i) is the weighted average of

Similar documents
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Learning with Local and Global Consistency

Learning with Local and Global Consistency

Graph-based Semi-supervised Learning

DATA ANALYSIS II. Matrix Algorithms

Appendix A. Rayleigh Ratios and the Courant-Fischer Theorem

MapReduce Approach to Collective Classification for Networks

Part 2: Community Detection

Spectral Analysis of Signed Graphs for Clustering, Prediction and Visualization

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

SACOC: A spectral-based ACO clustering algorithm

Ranking on Data Manifolds

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Tree based ensemble models regularization by convex optimization

Soft Clustering with Projections: PCA, ICA, and Laplacian

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

HT2015: SC4 Statistical Data Mining and Machine Learning

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Geometric-Guided Label Propagation for Moving Object Detection

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

1 Spectral Methods for Dimensionality

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

A Tutorial on Spectral Clustering

8 Square matrices continued: Determinants

Manifold Learning Examples PCA, LLE and ISOMAP

Chapter 6. Orthogonality

Feature Point Selection using Structural Graph Matching for MLS based Image Registration

Compact Representations and Approximations for Compuation in Games

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Francesco Sorrentino Department of Mechanical Engineering

x = + x 2 + x

Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce

Online Semi-Supervised Learning

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Orthogonal Diagonalization of Symmetric Matrices

Principal components analysis

Maximum Margin Clustering

Analysis of an Artificial Hormone System (Extended abstract)

Segmentation & Clustering

ADVANCED MACHINE LEARNING. Introduction

Similarity and Diagonalization. Similar Matrices

Linearly Independent Sets and Linearly Dependent Sets

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Online Prediction on Large Diameter Graphs

Mining Social-Network Graphs

Linear Algebra and TI 89

Similar matrices and Jordan form

Notes on Symmetric Matrices

Graph Theoretic Uncertainty Principles

Least-Squares Intersection of Lines

Introduction to Matrix Algebra

Subspace Analysis and Optimization for AAM Based Face Alignment

Edge tracking for motion segmentation and depth ordering

Component Ordering in Independent Component Analysis Based on Data Power

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

Maximum Likelihood Graph Structure Estimation with Degree Distributions

Elasticity Theory Basics

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Circuit Analysis using the Node and Mesh Methods

Figure 1: Saving Project

Visualization of Large Font Databases

Linear Programming I

Numerical Analysis Lecture Notes

Vector and Matrix Norms

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Understanding Big Data Spectral Clustering

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Basics of Statistical Machine Learning

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

Hidden Markov Models

Differential Privacy Preserving Spectral Graph Analysis

Neural Networks Lesson 5 - Cluster Analysis

How To Understand The Network Of A Network

Real-Life Industrial Process Data Mining RISC-PROS. Activities of the RISC-PROS project

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

2014/02/13 Sphinx Lunch

Network Intrusion Detection using Semi Supervised Support Vector Machine

Introduction to Support Vector Machines. Colin Campbell, Bristol University

7 Gaussian Elimination and LU Factorization

Linear Programming. March 14, 2014

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Mean-Shift Tracking with Random Sampling

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

LINEAR ALGEBRA. September 23, 2010

Walk-Based Centrality and Communicability Measures for Network Analysis

ON THE DEGREES OF FREEDOM OF SIGNALS ON GRAPHS. Mikhail Tsitsvero and Sergio Barbarossa

[1] Diagonal factorization

Classification of Cartan matrices

Lecture 5: Singular Value Decomposition SVD (1)

Predict Influencers in the Social Network

MATH APPLIED MATRIX THEORY

Manifold regularized kernel logistic regression for web image annotation

Nonlinear Iterative Partial Least Squares Method

Transcription:

Let there be l labeled points (x ;y ):::(x l ;y l ), and u unlabeled points x l+ :::x l+u. Let n = l + u. Assume y 2f0;g. Assume we are given a n n symmetric weight matrix W between all points. Consider the graph G with n nodes and weights W on edges. Nodes :::l are labeled as y :::y l, and the task is to assign labels to nodes l +:::n. Label propagation Our strategy is to first compute a real function f on G with certain nice properties, and then assign labels based on f. Intuitively we want close points to have similar labels. One way is to let the labels y :::y l propagate through the graph structure G. In the continuous case, G becomes the data manifold and propagation becomes diffusion. We do require, however, that labeled data remain fixed during the process. Let D be the diagonal matrix s.t. D ii = P n w j= ij. D ii is known as the volumn of node i. Define a probabilistic transition matrix P = D W. Wehave the following algorithm. Algorithm (Label propagation) Repeat until convergence:. Clamp f(i) = y i,i=:::l. 2. Propagate f = Pf. We showed in [] that the lable propagation algorithm converges to a simple solution regardless of the initial f. We split W matrix into 4 blocks after the lth row / column (and D, P similarly): " # Wll W lu W ul W uu W = () " # fl Let f = where f f l are clamped to the labeled data and f u are on u unlabeled points. Then the solution is f u = (I P uu ) P ul f l (2) = (I D uu uu) D uu ulf l (3) = (D uu W uu ) W uu f l (4) The algorithm is closely related to the random walk work by Szummer and Jaakkola [2], with two major differences:. we fix the labeled points; 2. our solution is an equilibrium state while in [2] it is dynamic w.r.t a time parameter t. We will come back to this point when discussing heat kernels later. The transition matrix was slightly different in [] but it doesn't affect the analysis

2 Random walk and electric networks Solution of this form is not new. It is not hard to see for any unlabeled node i, f(i) is the weighted average of all its neighboring f values. Such f is called the harmonic function over G with Dirichlet boundary condition f l, and uniquely exists. Doyle and Snell discussed it in [3]. Using it for semi-supervised learning, as far as we know, has not been proposed before. There is a random walk interpretation to the solution. Imagine a particle walking on G. If it is on an unlabeled node i, it will walk to a node j with probability P ij after one step. Labeled nodes are absorbing: once the particle reaches a labeled node, it stays there forever. Then f(i) is the probability that the particle, starting from node i, reaches an absorbing state with label before reaching an absorbing state with label 0. An electric networks interpretation is also given in [3]. Imagine the edges of G being resistors with conductance W. We connect y = labeled nodes to + voltage source, and y = 0 labeled nodes to the ground. Then f u is the voltage on unlabeled nodes. Furthermore f u minimizes the energy dissipation of the electric network G, for the given f l : 3 Integrated heat kernel i;j 2 (f(i) f(j))2 w ij (5) (hmm... in general L = D W has no invert... what's wrong?) The `integrated heat kernel' is an integration of heat kernels on unlabeled points. The weight matrix on unlabeled points is W UU. Let D U be the U U diagonal matrix s.t. D ii = P n W j=l+ ij. Note unlike D U +, D U does not include weights to labeled points. The graph Laplacian on unlabeled data is L = D U W UU. The standard heat kernel is [4] K t = e tl (6) The integrated heat kernel is K = = Z 0 Z 0 K t dt (7) e tl dt (8) 2

= L e tl j 0 (9) = L (0) = (D U W UU ) () Now for an unlabeled point x, John suggested that its label can be computed by looking at all unlabeled neighbors z, and z's labeled neighbors y. That is, f x / z2u / z2u y2l y2l K(x; z)p (z! y)f y (2) K(x; z) W zy D Uzz f y (3) 4 Graph Laplacian and spectral clustering Remember the solution f minimizes the energy (5) given the constraint f l. The well studied spectral graph segmentation problem minimizes the same energy function without the constraint. Belkin and Niyogi [5] showed that the energy function i;j 2 (f(i) f(j))2 w ij = f 0 Lf (4) where L = D W is the graph Laplacian of G. When f is not constrained (except normalized), the energy is minimized by the minimum eigenvalue solution to the generalized eigenvalue problem Lf = Df. It can be shown that f = is a trivial solution with minimum eigenvalue = 0. This corresponds to an electric network with the same voltage on all nodes, thus no energy dissipation. The next interesting solution is the eigenvector corresponding to the second smallest (non zero) eigenvalue. This is exact the normalized cut algorithm by Shi and Malik [6]. In addition when W is close to block diagonal, it can be shown that the image of data points are tightly clustered in the eigenspace spanned by the first few eigenvectors of L [7] [8]. This leads to spectral clustering algorithms. In semi-supervised learning, we have labeled data f l. A naive algorithm is to cluster the whole data set first with spectral clustering, then assign labels to clusters. This works well when W is block diagonal, but we believe the constrained solution f to be more robust when W is less well formed. 3

A related algorithm is proposed by Belkin and Niyogi in [9]. Note the normalized eigenvectors of L form a basis on G. The energy (or `smoothness' as it is called in [9]) of any normalized eigenvector is its eigenvalue. Belkin et al. proposed to learn a function (called g here) on G that is regularized for small energy and aims to fit the labeled data. In particular, they pick p normalized eigenvectors of L with the smallest eigenvalues, and g is the linear combination of the p eigenvectors which best fits f l in least squares sense. We remark that f by definition fits labeled data exactly, while g may not; f is the minimum energy solution, while g, although made up of low energy eigenvector basis, may befar from minimum energy. 5 From f function to classification The seemingly obvious classification method is to assign node i label if f(i) > 0:5, and label 0 otherwise. This works well when classes are well separated. However in real datasets classes are often not ideally separated, and this 0.5-thresholding tends to produce severely unbalanced classes. We experiment with three ways to control class balance: ffl Constraining the mean of f ffl Add two virtual nodes to G ffl Heuristic class mass normalization References [] iaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-07, Carnegie Mellon University, 2002. [2] Martin Szummer and Tommi Jaakkola. Partially labeled classification with Markov random walks. In NIPS, 200. [3] P.G. Doyle and J.L. Snell. Random Walks and Electric Networks. Mathematical Assoc. of America, 984. 4

[4] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proc. 9th International Conf. on Machine Learning, 2002. [5] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Technical Report TR-2002-0, University of Chicago, 2002. [6] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000. [7] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm, 200. [8] M. Meila and J. Shi. A random walks view of spectral segmentation. In AISTATS, 200. [9] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on manifolds. Technical Report TR-2002-2, University of Chicago, 2002. 5