Class-specific Sparse Coding for Learning of Object Representations



Similar documents
Statistical Machine Learning

Section for Cognitive Systems DTU Informatics, Technical University of Denmark

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Efficient online learning of a non-negative sparse autoencoder

Predict Influencers in the Social Network

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Categorical Data Visualization and Clustering Using Subjective Factors

Manifold Learning Examples PCA, LLE and ISOMAP

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Component Ordering in Independent Component Analysis Based on Data Power

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

Several Views of Support Vector Machines

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Biometric Authentication using Online Signatures

Object Recognition and Template Matching

A secure face tracking system

Neural Network based Vehicle Classification for Intelligent Traffic Control

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Solving Simultaneous Equations and Matrices

Chapter ML:XI (continued)

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY Principal Components Null Space Analysis for Image and Video Classification

Subspace Analysis and Optimization for AAM Based Face Alignment

Visualization by Linear Projections as Information Retrieval

STA 4273H: Statistical Machine Learning

Determining optimal window size for texture feature extraction methods

Linear Threshold Units

An Introduction to Machine Learning

Applications to Data Smoothing and Image Processing I

Low-resolution Character Recognition by Video-based Super-resolution

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Support Vector Machines with Clustering for Training with Very Large Datasets

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Performance Metrics for Graph Mining Tasks

Data Mining: Algorithms and Applications Matrix Math Review

Face Recognition in Low-resolution Images by Using Local Zernike Moments

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

Machine Learning and Pattern Recognition Logistic Regression

REAL TIME TRAFFIC LIGHT CONTROL USING IMAGE PROCESSING

Lecture 9: Introduction to Pattern Analysis

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL COMPONENT ANALYIS

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Template-based Eye and Mouth Detection for 3D Video Conferencing

Soft Clustering with Projections: PCA, ICA, and Laplacian

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Novelty Detection in image recognition using IRF Neural Networks properties

TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Support Vector Machines Explained

Force/position control of a robotic system for transcranial magnetic stimulation

Learning Invariant Visual Shape Representations from Physics

Support Vector Machine (SVM)

Lecture 3: Linear methods for classification

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Gaussian Processes in Machine Learning

Introduction: Overview of Kernel Methods

Intrusion Detection via Machine Learning for SCADA System Protection

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Super-resolution method based on edge feature for high resolution imaging

A Simple Introduction to Support Vector Machines

Monotonicity Hints. Abstract

Programming Exercise 3: Multi-class Classification and Neural Networks

Linear Programming I

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

On the Interaction and Competition among Internet Service Providers

Least-Squares Intersection of Lines

Analecta Vol. 8, No. 2 ISSN

Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior

Lecture 6. Artificial Neural Networks

Self Organizing Maps: Fundamentals

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Towards better accuracy for Spam predictions

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Manipulation Primitives from Visual Data

Optimal Scheduling for Dependent Details Processing Using MS Excel Solver

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

QUANTIZED PRINCIPAL COMPONENT ANALYSIS WITH APPLICATIONS TO LOW-BANDWIDTH IMAGE COMPRESSION AND COMMUNICATION. D. Wooden. M. Egerstedt. B.K.

PARTIAL FINGERPRINT REGISTRATION FOR FORENSICS USING MINUTIAE-GENERATED ORIENTATION FIELDS

Supervised Self-taught Learning: Actively Transferring Knowledge from Unlabeled Data

6.2.8 Neural networks for data mining

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Statistical machine learning, high dimension and big data

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Automatic Labeling of Lane Markings for Autonomous Vehicles

Joint models for classification and comparison of mortality in different countries.

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Machine Learning in FX Carry Basket Prediction

Environmental Remote Sensing GEOG 2021

Penalized Logistic Regression and Classification of Microarray Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Taking Inverse Graphics Seriously

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Transcription:

Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany stephan.hasler@honda-ri.de Abstract. We present two new methods which extend the traditional sparse coding approach with supervised components. The goal of these extensions is to increase the suitability of the learned features for classification tasks while keeping most of their general representation performance. A special visualization is introduced which allows to show the principal effect of the new methods. Furthermore some first experimental results are obtained for the COIL-100 database. 1 Introduction Sparse coding [4] searches for a linear code representing the data. Its target is to combine efficient reconstruction with a sparse usage of the representing basis, resulting in the following cost function: P S = 1 x i 2 c ij w j + γ Φ (c ij ). (1) 2 i j i j The left reconstruction term approximates each input x i = (x i1, x i2,..., x ik ) T by a linear combination r i = j c ijw j of the weights w j = (w j1, w j2,..., w jk ) T, where r i is called the reconstruction of the corresponding x i. The coefficients c ij specify how much the j th weight is involved in the reconstruction of the i th data vector. The right sparsity term sums up the c ij. The nonlinear function Φ (e.g. Φ( ) = ) increases the costs, the more the activation is spread over different c ij, and so many of them become zero while few are highly activated. The influence of the sparsity term is scaled with the positive constant γ. An adaptation of the sparse coding is the nonnegative sparse coding [7]. In this approach the coefficients and the elements of the weights are kept positive. This forces the weights to become more distinct and produces a parts-based representation similar to that obtained by nonnegative matrix factorization [2] with sparseness constraints [1]. In Sect. 2 we introduce two new class-specific extensions of the nonnegative sparse coding. We visualize their principal effect with a simple 2D example to analyze the influence of certain parameters of the cost functions. Furthermore some first experimental results are obtained for the COIL-100 database [3] in Sect. 3. Section 4 gives a short conclusion of the results.

2 1 a) b) 1 d = 0.8 w~ 1 x 1 0.8 0.6 c 12 ~ r 1 x 2 0.6 0.4 c 11 ~ r 2 w ~ 2 0.4 0.2 c =0 21 0.2 c 22 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 Lin. Comb. Weights Data Reconstructions Weights Data Reconstructions Fig. 1. a) Schematic description of visualization. In the visualization the positions of the data vectors, their reconstructions and the weights are plotted for different values of a control parameter of the cost function, e.g. γ = γ min... γ max. The data vectors x i lie on the unit circle. The shown w j = dw j are the weights which have been scaled by a factor d = (γ max γ)/(γ max γ min). Similarly the r i = dr i = d j cijwj are the reconstructions scaled by the same factor. The scaling causes the r i and the w j to move towards the origin with increasing γ. This simply should increase the ability to distinguish the points belonging to different values of γ (see b). Because the weights w j are normalized, the distance of the w j from the origin is as big as the scaling factor d. The distance of the r i from the origin is also influenced by the cost function itself. In the nonnegative case it is always shortened, since low values of the basis coefficients c ij are enforced. From the position of the r i and the w j it is possible to determine the coefficients c ij and hence to judge the sparsity of the reconstructions of certain x i. For example x 2 is reconstructed sparsely, because it only uses w 2. In the visualization the linear combinations will not be plotted and there will be points for each r i and w 1, w 2 for each value of γ. The points belonging to the same value of γ could be determined by counting, preferably from the unit circle to the origin. b) Visualization of the influence of the parameter γ on the nonnegative sparse coding. Two scaled weights and the scaled reconstructions of 10 data vectors are plotted for 31 different influences γ of the sparsity term. The scaling factor is d = (γ max γ)/(γ max γ min). The reconstructions belonging to successive values of γ are connected. For γ = γ min 0 the algorithm searches for the sparsest perfect reconstruction. And since d = 1 in this case the r i lie directly on the x i. The corresponding w j are aligned with the outermost x i. With increasing γ the points move towards the origin. Each r i gives up the use of the less suitable weight and therefore the r i unite to two main paths. Each w j aligns to the center of the r i which are assigned to it.

3 2 Class-specific Sparse Coding The sparse coding features are useful for general image representation but lack the property of being class-specific, and so their use in classification tasks is limited. Our two new approaches extend the nonnegative sparse coding with supervised components. In the first approach the class information has direct effect on the coefficients and it will therefore be referred to as coefficient coding: 2 c ij w j c ij cĩj. (2) P C = 1 2 i x i j + γ i,j c ij + 1 2 α j i,ĩ Q i Qĩ The right coefficient term causes costs if coefficients belonging to the same weight w j are active for representatives x i and xĩ of different classes Q i and Qĩ respectively. Q i stands for the class of a data vector x i. The influence of the coefficient term is scaled with the positive constant α. In the second approach the class information has a more direct effect on the weights and it will therefore be referred to as weight coding: 2 ( ) ( ) c ij w j x T i w j x Tĩ w j. (3) P W = 1 2 i x i j + γ i,j c ij + 1 2 β j i,ĩ Q i Qĩ The right weight term is similar to a linear discriminator and causes costs if a w j has a high inner product with representatives x i and xĩ of different classes Q i and Qĩ respectively. The weight term is scaled with the positive constant β. The minimization of both cost functions is done by alternately applying coefficient and weight steps as described in [7]. In the coefficient step the cost function is minimized with respect to the c ij using an asynchronous fixed-point search, while keeping the w j constant. The weight step is a single gradient step with a fixed step size in the w j, keeping the c ij constant. In Fig. 1 we introduce our special visualization schematically and apply it to the nonnegative sparse coding, and in Fig. 2 it is used to compare coefficient and weight coding. The coefficient coding restricts the use of features through different classes. This means that each feature concentrates on a single class and so the influence of other classes is weakened. There is no increase in discriminative quality, because this would require a strong interaction of different classes. The weight coding instead has a more direct effect on the features and removes activation from them, that is present in different classes. So the features represent more typical aspects of certain objects and so their suitability for object representation and recognition is increased. The cost function shows similarity to Fishers linear discriminant and the MRDF approach [5] but does not minimize the intra class variance. The advantage of the weight coding is that it can produce an overcomplete representation while the number of features in the Fisher linear discriminant is limited by the number of classes and in the MRDF by the number of dimensions in the data. The two parameters have to be chosen carefully.

4 a) 1 1 b) 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Weights Data Reconstructions Fig. 2. Visualization of the influence of the parameters α and β on the coefficient and weight coding. The 10 data vectors are now assigned to 2 classes and again are reconstructed using two weights. The influence of the sparsity term is the same constant value γ = 0.05 in both cases. a) In the coefficient coding the influence α of the coefficient term varies in a defined range. The scaling is d = (α max α)/(α max α min). With increasing α the points are pulled to the origin. Each weight is forced to specialize on a certain class and therefore moves to the center of this class. There is no gain in discriminative power. b) In the weight coding the influence β of the weight term varies in a defined range. The scaling is d = (β max β)/(β max β min). With increasing β the points are pulled to the origin. The suitability of the weights for different classes is reduced and each aligns to activation which is most typical for the class it is representing. So one weight moves towards the x-axis and the other one towards the y-axis. In the nonnegative case this can be referred to as a gain in discriminative power. When the influence of the sparsity term is too weak, many features tend to represent the same activation. 3 Experimental Results To underpin the qualitative difference between coefficient coding and weight coding both approaches have been applied to a more complex problem. Nine car objects and nine box objects from the COIL-100 database [3], each with 72 rotation views, were combined to two classes (see Fig. 3). Fifty features were trained using the same influence γ = 0.1 of the sparsity term and relative high values for the influence α = 1 10 3 of the coefficient term and the influence β = 5 10 7 of the weight term. The resulting features are shown in Fig. 4. Table 1 lists the values of the terms of the cost functions after optimization. These values are useful to interpret the effect of our two new approaches compared to the nonnegative sparse coding: Since the coefficient term puts a penalty on the use of features across different

5 Cars Boxes Fig. 3. Two class problem. Nine cars and nine boxes were combined to two classes. classes, a splitting into partial class problems is to be observed, leading to a reduced feature basis for each class. As a result, there is an increase of the reconstruction costs and a decrease of the sparsity costs. The demand for sparsity of the coefficients in the nonnegative sparse coding has an opposite effect on the weights, forcing them to become very view-specific and leading to a higher reconstruction cost. In the weight coding the weight term removes activation from the features. They become less view-specific, which causes a decrease of the reconstruction costs, but an increase of the sparsity costs. Table 1. The table shows the values of the different terms, neglecting their actual influences (γ, α, and β )to the cost functions. Note that values in brackets were not used for optimization, but are shown to highlight qualitative differences between features. Reconstruction Sparsity Coefficient Weight Nonneg. Sparse Coding 4.182 10 4 4.293 10 4 (1.726 10 7 ) (1.428 10 10 ) Coefficient Coding 4.682 10 4 3.872 10 4 5.709 10 5 (1.731 10 10 ) Weight Coding 4.031 10 4 4.976 10 4 (2.393 10 7 ) 1.035 10 10 4 Conclusion In this paper two new class-specific extensions of the nonnegative sparse coding were introduced. It was shown that the coefficient coding, by restricting the use of features through different classes, does not increase the discriminative quality of the features, but instead tends to cause a splitting into partial problems, using a distinct feature basis for representing each class. In contrast to that, the weight coding directly penalizes the suitability of features for different classes and so successfully combines representative and discriminative properties. This combination produces features which are more suitable for object representation than features with general representative quality only. The advantage of the weight coding is that it can produce an overcomplete representation, whereas most other approaches are using the covariance matrix directly, and so the number of features is limited by the number of dimensions in the data. The drawback is that two parameters have to be tuned suitably. Also the intra class variance is not reduced as e.g. in the MRDF approach. The evaluation of the usefulness of the weight coding features in object recognition will be subject to further investigations.

6 Nonnegative Sparse Coding Coefficient Coding Weight Coding Fig. 4. Features for three approaches. For the visualization, we arranged the features in the following way: Each feature is assigned to the class in which it is most often detected. A feature is detected if its normalized cross-correlation with an image exceeds a feature-specific threshold. This threshold was determined as to maximize the mutual information conveyed by the detection of the feature about the classes (see [6]). The more car-like features start at the top-left and the more box-like features at the bottom-right. The features in one class are arranged by descending mutual information. Therefore the least informative features of the two classes meet somewhere in the middle. The features of the nonnegative sparse coding and the coefficient coding are very similar to each other. The features of the weight coding are sparser and concentrate on more typical class attributes, like certain parts of the cars or the vertices of the boxes. References 1. Hoyer, P.: Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 5 (2004) 1457-1469 2. Lee, D.L., Seung, S.: Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999) 788 791 3. Nayar, S.K., Nene S.A., Murase H.: Real-time 100 object recognition system. In Proc. IEEE Conference on Robotics and Automation 3 (1996) 2321 2325 4. Olshausen, B., Field, D.J.: Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (1996) 607 609 5. Talukder, A., Casasent, D.: Classification and Pose Estimation of Objects using Nonlinear Features. In Proc. SPIE: Applications and Science of Computational Intelligence 3390 (1998) 12 23 6. Ullman, S., Bart, E.: Recognition invariance obtained by extended and invariant features. Neural Networks 17(1) (2004) 833 848 7. Wersing, H., Körner, E.: Learning Optimized Features for Hierarchical Models of Invariant Object Recognition. Neural Computation 15(7) (2003) 1559 1588