Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Similar documents
Keywords: Image Generation and Manipulation, Video Processing, Video Factorization, Face Morphing

Probabilistic Latent Semantic Analysis (plsa)

Predict Influencers in the Social Network

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Maximum-Margin Matrix Factorization

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Linear Codes. Chapter Basics

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Geometric-Guided Label Propagation for Moving Object Detection

Mathematics Course 111: Algebra I Part IV: Vector Spaces

: Introduction to Machine Learning Dr. Rita Osadchy

Active Learning with Boosting for Spam Detection

Machine Learning using MapReduce

Orthogonal Diagonalization of Symmetric Matrices

HDDVis: An Interactive Tool for High Dimensional Data Visualization

DATA ANALYSIS II. Matrix Algorithms

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Classification algorithm in Data mining: An Overview

Distributed Structured Prediction for Big Data

Social Media Mining. Data Mining Essentials

Simple and efficient online algorithms for real world applications

The Data Mining Process

1 Introduction to Matrices

An Effective Way to Ensemble the Clusters

Statistical machine learning, high dimension and big data

Machine Learning Final Project Spam Filtering

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

An Initial Study on High-Dimensional Data Visualization Through Subspace Clustering

Learning Tools for Big Data Analytics

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Clustering Connectionist and Statistical Language Processing

α = u v. In other words, Orthogonal Projection

Linear Programming. March 14, 2014

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Object Recognition. Selim Aksoy. Bilkent University

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Machine Learning for Data Science (CS4786) Lecture 1

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Linear Algebra Review. Vectors

An Overview of Knowledge Discovery Database and Data mining Techniques

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES. Eva Hörster

MS1b Statistical Data Mining

Local features and matching. Image classification & object localization

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

K-Means Clustering Tutorial

SYMMETRIC EIGENFACES MILI I. SHAH

Support Vector Machines with Clustering for Training with Very Large Datasets

Chapter 6. The stacking ensemble approach

Object class recognition using unsupervised scale-invariant learning

Recall that two vectors in are perpendicular or orthogonal provided that their dot

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

1. Classification problems

Neural Networks Lesson 5 - Cluster Analysis

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Visualization of General Defined Space Data

Programming Exercise 3: Multi-class Classification and Neural Networks

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining. Nonlinear Classification

Learning with Local and Global Consistency

NOTES ON LINEAR TRANSFORMATIONS

Equilibrium computation: Part 1

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Learning with Local and Global Consistency

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Lecture Topic: Low-Rank Approximations

A Learning Based Method for Super-Resolution of Low Resolution Images

STA 4273H: Statistical Machine Learning

Soft Clustering with Projections: PCA, ICA, and Laplacian

Learning from Diversity

A Network Flow Approach in Cloud Computing

CS 534: Computer Vision 3D Model-based recognition

Large-Scale Similarity and Distance Metric Learning

Building Data Cubes and Mining Them. Jelena Jovanovic

Big graphs: Theory and Practice, January 6-8, 2016, UC San Diego. Abstracts

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Introduction to Machine Learning Using Python. Vikram Kamath

Complex Networks Analysis: Clustering Methods

Sparse Nonnegative Matrix Factorization for Clustering

Online Classification on a Budget

The von Kries Hypothesis and a Basis for Color Constancy

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Big Data - Lecture 1 Optimization reminders

Data Mining Yelp Data - Predicting rating stars from review text

Globally Optimal Crowdsourcing Quality Management

Discovering objects and their location in images

Partial Least Squares (PLS) Regression.

Identification algorithms for hybrid systems

Inner Product Spaces and Orthogonality

How To Cluster

Nonlinear Programming Methods.S2 Quadratic Programming

Transcription:

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University ICML07 Tutorial

Factorizations of Multi-Dimensional Arrays Normally: factorize the data into a lower dimensional space in order to describe the original data in a concise manner. Focus of Lecture: Factorization of empirical joint distribution Factorization of symmetric forms Latent Class Model Probabilistic clustering. Factorization of partially-symmetric forms Latent clustering ICML07 Tutorial

N-way array decompositions A rank=1 matrix G is represented by an outer-product of two vectors: A rank=1 tensor (n-way array) of n vectors is represented by an outer-product ICML07 Tutorial

N-way array decompositions A matrix G is of (at most) rank=k if it can be represented by a sum of k rank-1 matrices: A tensor G is (at most) rank=k if it can be represented by a sum of k rank-1 tensors: Example: ICML07 Tutorial

N-way array Symmetric Decompositions A symmetric rank=1 matrix G: A symmetric rank=k matrix G: A super-symmetric rank=1 tensor (n-way array), is represented by an outer-product of n copies of a single vector A super-symmetric tensor described as sum of k super-symmetric rank=1 tensors: is (at most) rank=k. ICML07 Tutorial

General Tensors Latent Class Models ICML07 Tutorial 6

Reduced Rank in Statistics Let and be two random variables taking values in the sets The statement is independent of, denoted by means: is a 2D array (a matrix) is a 1D array (a vector) is a 1D array (a vector) means that is a rank=1 matrix Hebrew University

Reduced Rank in Statistics Let be random variables taking values in the sets The statement means: is a n-way array (a tensor) is a 1D array (a vector) whose entries means that is a rank=1 tensor Hebrew University

Reduced Rank in Statistics Let be three random variables taking values in the sets The conditional independence statement means: Slice is a rank=1 matrix Hebrew University

Reduced Rank in Statistics: Latent Class Model Let Let be random variables taking values in the sets be a hidden random variable taking values in the set The observed joint probability n-way array is: A statement of the form About the n-way array translates to the algebraic statement having tensor-rank equal to Hebrew University

Reduced Rank in Statistics: Latent Class Model is a n-way array is a 1D array (a vector) is a rank-1 n-way array is a 1D array (a vector) Hebrew University

Reduced Rank in Statistics: Latent Class Model for n=2: Hebrew University

Reduced Rank in Statistics: Latent Class Model Hebrew University

Reduced Rank in Statistics: Latent Class Model when loss() is the relative-entropy: then the factorization above is called plsa (Hofmann 1999) Typical Algorithm: Expectation Maximization (EM) Hebrew University

Reduced Rank in Statistics: Latent Class Model The Expectation-Maximization algorithm introduces auxiliary tensors and alternates among the three sets of variables: Hadamard product. reduces to repeated application of projection onto probability simplex Hebrew University 15

Reduced Rank in Statistics: Latent Class Model An L2 version can be realized by replacing RE in the projection operation: Hebrew University

Reduced Rank in Statistics: Latent Class Model Let and we wish to find which minimizes: Hebrew University

Reduced Rank in Statistics: Latent Class Model To find given, we need to solve a convex program: Hebrew University

Reduced Rank in Statistics: Latent Class Model An equivalent representation: Non-negative Matrix Factorization (normalize columns of G) Hebrew University

(non-negative) Low Rank Decompositions Measurements (non-negative) Low Rank Decompositions The rank-1 blocks tend to represent local parts of the image class Hierarchical buildup of important parts 4 factors 8 factors 12 factors 16 factors 20 factors ICML07 Tutorial

(non-negative) Low Rank Decompositions Example: The swimmer Sample set (256) Non-negative Tensor Factorization NMF ICML07 Tutorial

Super-symmetric Decompositions Clustering over Hypergraphs Matrices ICML07 Tutorial 22

Clustering data into k groups: Pairwise Affinity input points input (pairwise) affinity value interpret as the probability that and are clustered together unknown class labels ICML07 Tutorial 23

Clustering data into k groups: Pairwise Affinity input points k=3 clusters, this point does not belong to any cluster in a hard sense. input (pairwise) affinity value interpret as the probability that and are clustered together unknown class labels A probabilistic view: probability that belongs to the j th cluster note: What is the (algebraic) relationship between the input matrix K and the desired G? ICML07 Tutorial 24

Clustering data into k groups: Pairwise Affinity Assume the following conditional independence statements: ICML07 Tutorial 25

Probabilistic Min-Cut A hard assignment requires that: where are the cardinalities of the clusters Proposition: the feasible set of matrices G that satisfy are of the form: equivalent to: Min-cut formulation ICML07 Tutorial 26

Relation to Spectral Clustering Add a balancing constraint: put together means that also, and means that n/k can be dropped. Relax the balancing constraint by replacing K with the closest doubly stochastic matrix K and ignore the non-negativity constraint: ICML07 Tutorial 27

Relation to Spectral Clustering Let Then: is the closest D.S. in L1 error norm. Normalized-Cuts Ratio-cuts Proposition: iterating with is converges to the closest D.S. in KL-div error measure. ICML07 Tutorial 28

New Normalization for Spectral Clustering where we are looking for the closest doubly-stochastic matrix in least-squares sense. ICML07 Tutorial 29

New Normalization for Spectral Clustering to find K as above, we break this into two subproblems: use the Von-Neumann successive projection lemma: ICML07 Tutorial 30

New Normalization for Spectral Clustering successive-projection vs. QP solver running time for the three normalizations ICML07 Tutorial 31

New Normalization for Spectral Clustering UCI Data-sets Cancer Data-sets ICML07 Tutorial 32

Super-symmetric Decompositions Clustering over Hypergraphs Tensors ICML07 Tutorial 33

Clustering data into k groups: Beyond Pairwise Affinity A model selection problem that is determined by n-1 points can be described by a factorization problem of n-way array (tensor). Example: clustering m points into k lines Input: Output: the probability that same model (line). belong to the the probability that the point belongs to the r th model (line) Under the independence assumption: is a 3-dimensional super-symmetric tensor of rank=4 ICML07 Tutorial

Clustering data into k groups: Beyond Pairwise Affinity General setting: clusters are defined by n-1 dim subspaces, then for each n- tuple of points we define an affinity value where is the volume defined by the n-tuple. Input: the probability that belong to the same cluster Output: the probability that the point belongs to the r th cluster Assume the conditional independence: is a n-dimensional super-symmetric tensor of rank=k ICML07 Tutorial

Clustering data into k groups: Beyond Pairwise Affinity Hyper-stochastic constraint: under balancing requirement K is (scaled) hyper-stochastic: Theorem: for any non-negative super-symmetric tensor, iterating converges to a hyper-stochastic tensor. ICML07 Tutorial

Example: multi-body segmentation Model Selection 9-way array, each entry contains The probability that a choice of 9-tuple of points arise from the same model. Probability: ICML07 Tutorial

Example: visual recognition under changing illumination 4-way array, each entry contains The probability that a choice of 4 images Live in a 3D subspace. Model Selection ICML07 Tutorial

Partially-symmetric Decompositions Latent Clustering ICML07 Tutorial 39

Latent Clustering Model collection of data points context states probability that are clustered together given that small high i.e., the vector has a low entropy meaning that the contribution of the context variable to the pairwise affinity is far from uniform. pairwise affinities do not convey sufficient information. input triple-wise affinities

Latent Clustering Model Hypergraph representation G(V,E,w) Hyper-edge defined over triplets representing two data points and one context state (any order) undetermined means we have no information, i.e, we do not have access to the probability that three data points are clustered together...

Latent Clustering Model Cluster 1 Cluster k Hyper-edge.. Probability that the context state is Associated with the cluster Super-symmetric NTF Hypergraph clustering would provide means to cluster together points whose pairwise affinities are low, but there exists a subset of context states for which is high. Multiple copies of this block All the remaining entries are undefined In that case, a cluster would be formed containing the data points and the subset of context states.

Latent Clustering Model Multiple copies of this block All the remaining entries are undefined

Latent Clustering Model Multiple copies of this block All the remaining entries are undefined

Latent Clustering Model: Application collection of image fragments from a large collection of images unlabeled images holding k object classes, one object per image Available info: # of obj classes k, each image contains only one instance from one (unknown) object class. Clustering Task: associate (probabilistically) fragments to object classes. Challenges: Clusters and obj classes are not the same thing. Generally, # of clusters is larger than # of obj classes. Also, the same cluster may be shared among a number of different classes. The probability that two fragments should belong to the same cluster may be low only because they appear together in a small subset of images small high

Images Examples 3 Classes of images: Cows Faces Cars

Fragments Examples

Examples of Representative Fragments

Leading Fragments Of Clusters Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6

Examples Of The Resulting Matrix G Rows 0.0129 0.0012 0.0102 0.0965 1 0 0 0 0 0.0008 0.0854 0.0056 0.0002 0 0.5049 0.9667 0 0.0233 0 0.0008 0.0001 0.0026 0 0 0 0.9177 0 0 0.0001 0 0 0 0 0 0.9439 0.6982 0.0049 0.0025 0.0061 0.7395 0.1594 0 0 0 0 0.0052 0.0108 0 0 0.1524 0 0 0 0.0007

Examples Of The Resulting Matrix G Rows 0.8776 0.0377 0.0819 0.0041 0.0054 0.0189 0 0 0.0011 0.0092 0.0192 0.0128 1.0000 0.2775 0.9007 0.0061 0.0046 0.0152 0.0064 0.0085 0.0016 0.0108 0.0102 0.0014 0.0726 0.9770 1.0000 0.0094 0.0014 0.0041 0 0.0141 0.0033 1.0000 0.1056 0.0304 0.4252 0.9024 0.4162 0.0442 0.0201 0.0082 0.0507 0.0254 0.0020 0 0.2835 0.0258 0.0329 0.0136 0.0091 0.0015 0.0023 0.0031

Handling Multiple Object Classes Unlabeled set of segmented images, each containing an instance of some unknown object class from a collection of 10 classes: (1) Bottle, (2) can, (3) do not enter sign,(4) stop sign, (5) one way sign,(6) frontal view car,(7) side view car,(8) face,(9) computer mouse, (10) pedestrian dataset adapted from Torralba,Murphey and Freeman, CVPR04

Clustering Fragments

From Clusters to Classes

Results A location which is associated with fragments voting consistently for the same object class will have high value in the corresponding voting map. Strongest hot-spots (local maximums) in the voting maps show the most probable locations of objects.

Results

Applications of NTF local features for object class recognition ICML07 Tutorial 57

Object Class Recognition using Filters Goal : Find a good and efficient collection of filters that captures the essence of an object class of images. Two main approaches: 1) Use a large pre-defined bank of filters which: Rich in variability. Efficiently convolved with an image. Viola-Jones Hel-Or 2) Use filters as basis vectors spanning a low-dimensional vector space which best fits the training images. PCA, HOSVD holistic NMF sparse The classification algorithm selects a subset of filters which is most discriminatory. Hebrew University 58

Object Class Recognition using NTF-Filters The optimization scheme: 1) 2) Construct a bank of filters based on Non-negative Tensor Factorization. Consider the filters as weak learners and use AdaBoost. Bank of Filters: We perform an NTF of k factors to approximate the set of original images. We perform an NTF of k factors to approximate the set of inverse images. There are filters Every filter is a pair of original/inverse factor. We take the difference between the two convolutions (original - inverse). Original/Inverse pair for face recognition Hebrew University 59

Object Class Recognition using NTF-Filters The optimization scheme: 1) Construct a bank of filters based on Non-negative Tensor Factorization. 2) Consider the filters as weak learners and use AdaBoost. There are original/inverse pairs of weak learners: We ran AdaBoost to construct a classifier of 50 original/inverse NTF pairs. For comparison we ran AdaBoost on other sets of weak learners: AdaBoost classifier constructed from 50 NMF-weak learners. AdaBoost classifier constructed from 50 PCA-weak learners. AdaBoost classifier constructed from 200 VJ-weak learners. An NTF-Filter contains 40 multiplications. An NMF / PCA filter contains about 400 multiplications, yet we have comparable results using the same number of filters. A Viola-Jones weak learner is simpler therefore we used more weaklearners in the AdaBoost classifier. Hebrew University 60

Face Detection using NTF-Filters We recovered 100 original factors and 100 inverse factors by preforming NTF on 500 faces of size 24x24. leading filters: NTF NMF PCA Hebrew University 61

Face Detection using NTF-Filters ROC curve: 1) For face recognition all the methods achieve comparable performance. Training: The AdaBoost was trained on 2500 faces and 4000 non-faces 2) Test: The AdaBoost was tested on the MIT Test Set, containing 23 images with 149 faces. Hebrew University 62

Face Detection using NTF-Filters Results Hebrew University 63

Pedestrian Detection using NTF-Filters Sample of the database: Hebrew University 64

Pedestrian Detection using NTF-Filters We recovered 100 original factors and 100 inverse factors by preforming NTF on 500 pedestrians of size 10x30. leading filters: NTF NMF PCA Hebrew University 65

Pedestrian Detection using NTF-Filters ROC curve: 1) 2) For pedestrian detection NTF achieves far better performance than the rest. Training: The AdaBoost was trained on 4000 pedestrians and 4000 nonpedestrains. Test: The AdaBoost was tested on 20 images with 103 pedestrians. Hebrew University 66

Summary Factorization of empirical joint distribution Factorization of symmetric forms Latent Class Model Probabilistic clustering. Factorization of partially-symmetric forms Latent clustering Further details in http://www.cs.huji.ac.il/~shashua Hebrew University 67

END ICML07 Tutorial