Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Similar documents
Exploiting Multimodal Data for Image Understanding

Edinburgh Research Explorer

Content-Based Image Retrieval

Baselines for Image Annotation

A New Baseline for Image Annotation

Probabilistic Latent Semantic Analysis (plsa)

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Web 3.0 image search: a World First

The Delicate Art of Flower Classification

Statistical Machine Learning

Is that you? Metric Learning Approaches for Face Identification

SCENE-BASED AUTOMATIC IMAGE ANNOTATION. Amara Tariq, Hassan Foroosh

MACHINE LEARNING IN HIGH ENERGY PHYSICS

ImageCLEF 2011

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Semantic Image Segmentation and Web-Supervised Visual Learning

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

MULTI-LEVEL SEMANTIC LABELING OF SKY/CLOUD IMAGES

Image Segmentation and Registration

Object Class Recognition using Images of Abstract Regions

Local features and matching. Image classification & object localization

Object Recognition. Selim Aksoy. Bilkent University

Linear Classification. Volker Tresp Summer 2015

HT2015: SC4 Statistical Data Mining and Machine Learning

Lecture 3: Linear methods for classification

Local classification and local likelihoods

YouTube Scale, Large Vocabulary Video Annotation

An Introduction to Data Mining

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Threshold Units

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Statistical Machine Learning from Data

Evaluating Sources and Strategies for Learning Video Concepts from Social Media

Object Categorization using Co-Occurrence, Location and Appearance

Visualization by Linear Projections as Information Retrieval

IN this paper we focus on the problem of large-scale, multiclass

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Christfried Webers. Canberra February June 2015

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Character Image Patterns as Big Data

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data Analytics CSCI 4030

Module II: Multimedia Data Mining

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Fast Matching of Binary Features

A Boosting Framework for Visuality-Preserving Distance Metric Learning and Its Application to Medical Image Retrieval

Image Classification for Dogs and Cats

Signature Segmentation from Machine Printed Documents using Conditional Random Field

Supervised Learning (Big Data Analytics)

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

STA 4273H: Statistical Machine Learning

Video Affective Content Recognition Based on Genetic Algorithm Combined HMM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A Simple Introduction to Support Vector Machines

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Mean-Shift Tracking with Random Sampling

Course: Model, Learning, and Inference: Lecture 5

Colour Image Segmentation Technique for Screen Printing

Collaborative Filtering. Radek Pelánek

A Genetic Algorithm-Evolved 3D Point Cloud Descriptor

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Segmentation & Clustering

An Automatic and Accurate Segmentation for High Resolution Satellite Image S.Saumya 1, D.V.Jiji Thanka Ligoshia 2

arxiv: v2 [cs.cv] 9 Jun 2016

The Power of Asymmetry in Binary Hashing

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Data Mining and Visualization

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Forschungskolleg Data Analytics Methods and Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Azure Machine Learning, SQL Data Mining and R

Lecture 6: Logistic Regression

Big Data: Image & Video Analytics

1. Classification problems

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Classification Techniques for Remote Sensing

High Level Describable Attributes for Predicting Aesthetics and Interestingness

Weighted Pseudo-Metric Discriminatory Power Improvement Using a Bayesian Logistic Regression Model Based on a Variational Method

CSCI567 Machine Learning (Fall 2014)

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Unsupervised Data Mining (Clustering)

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Latent Class Regression Part II

Active Learning SVM for Blogs recommendation

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Document Image Retrieval using Signatures as Queries

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

A Hybrid Neural Network-Latent Topic Model

Industrial Challenges for Content-Based Image Retrieval

Transcription:

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek & Cordelia Schmid LEAR Team, INRIA Rhône-Alpes Grenoble, France

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Goal: predict relevant keywords for images Approach: generalize from a data base of annotated images Application 1: Image annotation Propose a list of relevant keywords to assist human annotator Application 2: Keyword based image search Given one or more keywords propose a list of relevant images

Examples of Image Annotation true predicted (confidence) glacier glacier (1.00) mountain mountain (1.00) people front (0.64) tourist sky (0.58) people (0.58)

Examples of Image Annotation true predicted (confidence) landscape llama (1.00) lot water (1.00) meadow landscape (1.00) water front (0.60) people (0.51)

Examples of Keyword Based Retrieval Query: water, pool Relevant images: 10 Correct: 9 among top 10

Examples of Keyword Based Retrieval Query: beach, sand Relevant images: 8 Correct: 2 among top 8

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion & outlook

Related Work: Latent Topic Models Inspired from text-analysis models Probabilistic Latent Semantic Analysis Latent Dirichlet Allocation Generative model over keywords and image regions Trained to model both text and image Condition on image to predict text Trade-off: overfitting & capacity limited by nr. of topics [Barnard et al., Matching words and pictures, JMLR 03]

Related Work: Mixture models approaches Generative model over keywords and image Kernel density estimate (KDE) over image features KDE gives posterior weights for training images Use weights to average training annotations Non-parametric model only need to set KDE bandwith [Feng et al., Multiple Bernoulli relevance models, CVPR 04]

Related Work: Parallel Binary Classifiers Learn a binary classifier per keyword Need to learn many classifiers No parameter sharing between keywords Large class imbalances 1% positive data per class no exception [Grangier & Bengio. A discriminative kernel-based model to rank images from text queries, PAMI 08]

Related Work: Local learning approaches Use most similar images to predict keywords Diffusion of labels over similarity graph Nearest-neighbor classification State-of-the-art image annotation results What distance to define neighbors? [Makadia et al., A new baseline for image annotation, ECCV 08]

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion & outlook

A predictive model for keyword absence/presence Given: relevance of keywords w for images i yiw { 1, +1}, i {1,..., I}, w {1,..., W } Given: visual dissimilarity between images dij 0, i, j {1,..., I} Objective: optimally predict annotations y iw

A predictive model for keyword absence/presence π ij : weight of train image j for predictions for image i Weights defined through dissimilarities πij 0 and P j π ij = 1 p(y iw = +1) = π ij p(y iw = +1 j) (1) j { 1 ɛ for y jw = +1 p(y iw = +1 j) = (2) ɛ otherwise

A predictive model for keyword absence/presence Parameters: definition of the π ij from visual similarities p(y iw = +1) = j π ij p(y iw = +1 j) Learning objective: maximize probability of actual annotations L = c iw ln p(y iw ) (3) i w Annotation costs: absences are much noisier Emphasise prediction of keyword presences

Example-absences: examples of typical annotations Actual: wave (0.99), girl (0.99), flower (0.97), black (0.93), america (0.11) Predicted: people (1.00), woman (1.00), wave (0.99), group (0.99), girl (0.99) Actual: drawing (1.00), cartoon (1.00), kid (0.75), dog (0.72), brown (0.54) Predicted: drawing (1.00), cartoon (1.00), child (0.96), red (0.94), white (0.89)

Rank-based weighting of neighbors Weight given by rank The k-th neighbor always receives same weight If j is k-th nearest neighbor of i π ij = γ k (4) Optimization: L concave with respect to {γ k } Expectation-Maximization algorithm Projected gradient descent p(y iw = 1) = j L = i π ij p(y iw = 1 j) c iw ln p(y iw ) w

Rank-based weighting of neighbors Effective neighborhood size set automatically 0.25 0.2 Weight 0.15 0.1 0.05 0 0 5 10 15 20 Neighbor Rank

Distance-based weighting of neighbors Weight given by distance: d ij visual distance between images π ij = exp( λd ij) k exp( λd ik) (5) Single parameter: controls weight decay with distance Weights are smooth function of distances Optimization: gradient descent L λ = W (π ij ρ ij )d ij (6) i,j ρ ij = 1 p(j y iw ) = 1 π ij p(y iw j) W W k π (7) ikp(y iw k) w w

Metric learning for nearest neighbors What is an appropriate distance to define neighbors? Which image features to use? What distance over these features? Linear distance combination defines weights π ij = exp( w d ij ) k exp( w d ik ) (8) Learn distance combination maximize annotation log-likelihood as before one parameter for each base distance

A predictive model for keyword absence/presence

Low recall of rare words Let us annotate images with the 5 most likely keywords Recall for a keyword is defined as: # ims. annotated with keyword / # ims. truely having keyword Keywords with low frequency in database have low recall Neighbors that have the keyword do not account for enough mass Systematically lower presence probabilities Need to boost presence probability at some point 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1

Sigmoidal modulation of predictions Prediction of weighted nearest neighbor model x iw x iw = j π ij p(y iw = 1 j) (9) Word specific logistic discriminant model Allow to boost probability after a threshold value Adjusts dynamic range per word p(y iw = 1) = σ(α w x iw + β w ) (10) σ(z) = 1/ ( 1 + exp( z) ) (11) Train model using gradient descent in iterative manner Optimize (αw, β w) for all words, convex Optimize neighbor weights πij through parameters

Training the model in practice p(y iw ) = j L = i,w π ij p(y iw j) c iw ln p(y iw ) Computing L and gradient quadratic in nr. of images Use limited set of k neighbors for each image i We don t know the distance combination in advance Include as many neighbors from each distance as possible Overlap of neighborhoods allow to use approximately 2k/D

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion & outlook

Data set 1: Corel 5k 5.000 images, landscape, animals, cities,... 3 words on average, max. 5, per image vocabulary of 260 words Annotations designed for retrieval

Data set 2: ESP Game 20.000 images, photos, drawings, graphs,... 5 words on average, max. 15, per image vocabulary of 268 words Annotations generated by players of on-line game

Data set 2: ESP Game Annotations generated by players of on-line game Both players see same image, but cannot communicate Players gain points by typing same keyword

Data set 3: IAPR TC-12 20.000 images, touristic photos, sports 6 words on average, max. 23, per image vocabulary of 291 words Annotations obtained from descriptive text Extract nouns using natural language processing

Feature extraction Collection of 15 representations Color features, global histogram Color spaces: HSV, LAB, RGB Each channel quantized in 16 levels Local SIFT features [Lowe 04] Extraction on dense muti-scale grid, and interest points K-means quantization in 1.000 visual words Local Hue features [van de Weijer & Schmid 06] Extraction on dense muti-scale grid, and interest points K-means quantization in 100 visual words Global GIST features [Oliva & Torralba 01] Spatial 3 1 partitioning [Lazebnik et al. 06] Concatenate histograms from regions Done for all features, except GIST.

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion & outlook

Evaluation Measures Measures computed per keyword, then averaged Annotate images with the 5 most likely keywords Recall: # ims. correctly annotated / # ims. in ground truth Precision: # ims. correctly annotated / # ims. annotated N+: # words with non-zero recall Direct retrieval measures Rank all images according to a given keyword presence probability Compute precision all positions in the list (from 1 up to N) Average Precision: over all positions with correct images

5.30 10Results 40 100 Corel 200-10Annotation 40 100 200 K K.30 10 40 100 200 K.30 10 40 100 2 K re 2. Performance on the Corel data set of our distance-based models in terms of P µ, R µ, BEP, and MAP, for the WN model (dashe the σwn model (solid). Red curves correspond to a learned distance combination, and the blue ones to equal weighting of distance Previsously reported results Rank Based Distance Based CRM [10] InfNet[15] NPDE [22] SML [2] MBRM [5] TGLM [13] JEC [14] JEC-15 WN σwn WN σwn WN-ML σwn-ml P µ 16 17 18 23 24 25 27 28 28 26 30 28 31 33 R µ 19 24 21 29 25 29 32 33 32 34 33 35 37 42 N+ 107 112 114 137 122 131 139 140 136 143 136 145 146 160 le 2. Overview of performance in terms of P µ, R µ, and N+ of our models (using K = 200), and those reported in earlier work. J refers to our implementation of [14] using our 15 distances. We show results for our WN and σwn models using rank and distan d weights using the equal distance combination, as well as WN-ML and σwn-ml which integrate metric learning. Rank-based and distance-based weights comparable. Results for the Corel 5k data set In Figure 2 we show performance in terms of P µ, R Metric learning improves results BEP, considerably and MAP, for the distance based models as a fun n a first set of experiments we compare the different tion of the number of neighbours over which the weig ants of our model, and compare them to [14] using the e data. That are evaluated. From these results we conclude that cons is, Sigmoid we take a equally improves weightedrecall combinaof our 15 normalized base distances to define image tently over all number of neighbours, and for both WN a σwn, the metric learning finds distance combinations th ilarity. From the results in Table 2 we can make sevobservations. First, using the tag transfer method pro- outperform the equal weighted combination. Furthermo we see that σwn has a major impact on the R µ measu ed in [14] and our features (JEC-15) we obtain results and also leads to improvements in the other measures. similar results to the original work. Thus, other perfor-

Results Corel - Retrieval Mean Average Precision: roughly +10% overall Metric learning improves results considerably Sigmoid small effect: evaluated per word

659 Results 660 ESP & IAPR - Annotation 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 and σwn-ml (yellow), grouped with respect to their frequency in the data set. Keywords are binned based on how many images they occur in. The lower bars show the number of keywords in each bin, the upper bars show the average recall for words in a bin. IAPR ESP Game P µ R µ N+ P µ R µ N+ MBRM [5] 24 23 223 18 19 209 JEC [14] 28 29 250 22 25 224 JEC-15 29 19 211 24 19 222 WN 50 20 215 48 19 212 σwn 41 30 259 39 24 232 WN-ML 48 25 227 49 20 213 σwn-ml 46 35 266 39 27 239 Table 3. Performance in terms of P µ, R µ, and N+ on ESP and Metric IAPR using learning our models improves with distance results based considerably weights. As before, a ML indicates models that integrate metric learning. Sigmoid: trades precision for recall WN and σwn models, using equal and learned distance word a words Table 5 sets ob in the they a querie set inc querie there a (421) To compu all key

Detailed view of effect sigmoid Mean recall of words Keywords binned by how many images they occur in WN-ML (blue), and σwn-ml (yellow) The lower bars show nr. of keywords in each bin

Examples - Corel Retrieval tiger 100.00 (10) garden 60.00 (10) town 22.22 (9) water, pool 90.00 (10) beach, sand 25.00 (8)

Exampels - Corel Annotation BEP: 100% Ground Truth: sun (1.00), sky (1.00), tree (1.00), clouds (0.99) Predictions: sun (1.00), sky (1.00), tree (1.00), clouds (0.99) BEP: 100% Ground Truth: mosque (1.00), temple (1.00), stone (1.00), pillar (1.00) Predictions: mosque (1.00), temple (1.00), stone (1.00), pillar (1.00) BEP: 50% Ground Truth: grass (0.98), tree (0.98), bush (0.54), truck (0.05) Predictions: flowers (1.00), grass (0.98), tree (0.98), moose (0.95) BEP: 50% Ground Truth: herd (0.99), grass (0.98), tundra (0.96), caribou (0.13) Predictions: sky (0.99), herd (0.99), grass (0.98), hills (0.97) BEP: 50% Ground Truth: mountain (1.00), tree (0.99), sky (0.98), clouds (0.94) Predictions: hillside (1.00), mountain (1.00), valley (0.99), tree (0.99)

Break-down of computational effort Computation times for ESP data set 20.000 images Single recent desktop 4 core machine 1h08 : Low-level feature extraction 4h44 : Quantization of low level features (best of 10 k-means) 0h22 : Cluster assignments 4h15 : Pairwise distances 0h15 : Neighborhood computation (2.000) 0h05 : Parameter estimation

Conclusions and Outlook We surpassed the current state-of-the-art results Both on image annotation, and keyword-based retrieval On three different data sets and two evaluation protocols The main contributions of our work Metric learning within the annotation model Sigmoidal non-linearity to boost recall of rare words Topics of ongoing research Modeling of keyword absences Learn metric per annotation term Scaling up learning set to millions of images Online learning of the model