Visual Relationship Detection with Language Priors

Similar documents
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Semantic Recognition: Object Detection and Scene Segmentation

Bringing Semantics Into Focus Using Visual Abstraction

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

The goal is multiply object tracking by detection with application on pedestrians.

Fast R-CNN Object detection with Caffe

Convolutional Feature Maps

High Level Describable Attributes for Predicting Aesthetics and Interestingness

Semantic Aware Video Transcription Using Random Forest Classifiers

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Box 1 - Article Noun - Blue

Baby Talk: Understanding and Generating Image Descriptions

Transform-based Domain Adaptation for Big Data

Weakly Supervised Object Boundaries Supplementary material

Edge Boxes: Locating Object Proposals from Edges

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Nombre: I am not playing. You are not (you aren t) eating. He is not (He isn t) reading. She is not (She isn t) sleeping.

What more can we do with videos?

Curriculum Text. Texto del curso Texte du cours Kursinhalt Testo del corso. Inglés english UK Level 1. Inglese Britannico.

MODIFIERS. There are many different types of modifiers. Let's begin by taking a look at the most common ones.

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

SSD: Single Shot MultiBox Detector

ESL QUESTION 62 ANSWER 8 LUCKY CARDS

Semantic Image Segmentation and Web-Supervised Visual Learning

Nombre: Today is Monday. Yesterday was. Tomorrow will be. Today is Friday. Yesterday was. Tomorrow will be. Today is Wednesday.

Object Detection from Video Tubelets with Convolutional Neural Networks

Deformable Part Models with CNN Features

Coaching TOPSoccer. Training Session Activities. 1 US Youth Soccer

rarecorvettes.com, (831) Pacific Time Zone

Pedestrian Detection with RCNN

Enclosed is a detailed list of parade rules, Insurance requirements and an application form. Please review the

Quality Assessment for Crowdsourced Object Annotations

Finding people in repeated shots of the same scene

Localizing 3D cuboids in single-view images

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Prepositions. TLC/College of the Canyons. Prepared by Kim Haglund, M.Ed: TLC Coordinator

InstaNet: Object Classification Applied to Instagram Image Streams

SEMANTIC CONTEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION

car boat airplane train taxi bus motorcycle ambulance bicycle tricycle horse mule scooter

Tracking in flussi video 3D. Ing. Samuele Salti

Object Detection in Video using Faster R-CNN

Large-scale Knowledge Transfer for Object Localization in ImageNet

SUBJECT-VERB AGREEMENT

Segmentation as Selective Search for Object Recognition

Chapter 6 Quadratic Functions

LIBSVX and Video Segmentation Evaluation

Automatic Attribute Discovery and Characterization from Noisy Web Data

CAP 6412 Advanced Computer Vision

Nombre: Today is Monday. Yesterday was. Tomorrow will be. Today is Friday. Yesterday was. Tomorrow will be. Today is Wednesday.

First-Person Activity Recognition: What Are They Doing to Me?

Object Recognition. Selim Aksoy. Bilkent University

Bert Huang Department of Computer Science Virginia Tech

Learning and transferring mid-level image representions using convolutional neural networks

Università degli Studi di Bologna

Energy - Kinetic Energy and Potenial Energy

1. The more idioms you know how to use correctly, the more natural your speech will sound

Explore 3: Crash Test Dummies

Clustering Connectionist and Statistical Language Processing

Transfer Learning by Borrowing Examples for Multiclass Object Detection

Nombre: Today is Monday. Yesterday was. Tomorrow will be. Today is Friday. Yesterday was. Tomorrow will be. Today is Wednesday.

Character Image Patterns as Big Data

Image Compression through DCT and Huffman Coding Technique

Young Learners English

Machine Learning Final Project Spam Filtering

Chapter 3 Growing with Verbs 77

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

A picture is worth five captions

Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

A Bayesian Framework for Unsupervised One-Shot Learning of Object Categories

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

Task-driven Progressive Part Localization for Fine-grained Recognition

Mean-Shift Tracking with Random Sampling

Warmups and Energizers

3D Model based Object Class Detection in An Arbitrary View

Latest Advances in Deep Learning. Yao Chou

Evaluating Sources and Strategies for Learning Video Concepts from Social Media

Learning Spatial Context: Using Stuff to Find Things

ALBANY AREA YMCA HOMESCHOOL P.E. PROGRAM CURRICULUM GUIDE

Nombre: I am not playing. You are not (you aren t) eating. He is not (He isn t) reading. She is not (She isn t) sleeping.

A Note to Parents. 1. As you study the list, vary the order of the words.

Image Captioning A survey of recent deep-learning approaches

Studying Relationships Between Human Gaze, Description, and Computer Vision

Azure Machine Learning, SQL Data Mining and R

Fast Matching of Binary Features

arxiv: v2 [cs.cv] 19 Jun 2015

Object Categorization using Co-Occurrence, Location and Appearance

Scale-Invariant Object Categorization using a Scale-Adaptive Mean-Shift Search

FIRST TERM ENGLISH WORKSHEET

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

How To Generate Object Proposals On A Computer With A Large Image Of A Large Picture

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Difference between a vector and a scalar quantity. N or 90 o. S or 270 o

Parts of Speech. Skills Team, University of Hull

PHYS 117- Exam I. Multiple Choice Identify the letter of the choice that best completes the statement or answers the question.

English for Spanish Speakers. Second Edition. Caroline Nixon & Michael Tomlinson

Unsupervised Discovery of Mid-Level Discriminative Patches

Jointly Optimizing 3D Model Fitting and Fine-Grained Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

MODULE 4: Passenger Safety

Transcription:

Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University * = equal contribution

image #1 image #2 llama person llama person 2

next to chasing 3

4

Problem formulation Input (image only) 6

Problem formulation Input (image only) person person riding riding horse in front of horse Output 7

Problem formulation person person horse horse Input (image only) Output 8

Problem formulation Input (image only) person person riding riding horse horse Output 9

Problem formulation Input (image only) person person riding riding horse in front of horse Output 10

Related work Spatial relationships: cup on top of table Action relationships: person kick ball Common relationships: person wear shirt Roger et al. ICCV 2008 Galleguillos, CVPR 2008 Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009 Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR 2011 11

Visual Genome dataset 33K object categories 42K relationship categories dataset also contains descriptions, question answers and attributes Krishna et al. IJCV 2016 13

Observation 1: ride Quadratic explosion of - N objects, - K relationships leading to N2K detectors next to lying Visual Genome dataset N = 33K K = 42K drag falling off carry resting on throw 14

Observation #2 # of occurrences Long tail distribution of relationships - makes supervised training difficult relationships 15

Observation #2 # of occurrences car on wstreet dog behind w w tree dog ride skateboard Long tail distribution of relationships - makes supervised training difficult relationships 1 6

Observation #2 # of occurrences car on wstreet dog ride skateboard w elephant wdrink milk dog ride wsurfboard Long tail distribution of relationships - makes supervised training difficult relationships 1 7

Visual module Language module Input Tackles: Quadratic explosion of N2K detectors Tackles: Long tail distribution of relationships Output

Visual module Input Definitions: Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Definitions: Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person riding horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Proposals: Uijlings et al. IJCV 2013 Sample: Tackles: object detector relationship detector Quadratic explosion only requires N+K detectors

Language module Tackles: Long tail distribution can predict rare relationships o 1 : man r: ride o 2 : horse

Training the visual module 1. Pre-train using ImageNet object detector object detector relationship detector Definitions: Deng et al. 2009

Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Girshirk et al. CVPR 2014

Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

Training the visual module object detector relationship detector object detector Ranking loss 1. 2. 3. 4. Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] Deng et al. 2009

Training the language module dog ride skateboard w dog ride wsurfboard 34

Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance 35

Training the language module dog ride skateboard w 0 0 dog ride wsurfboard where cos is the cosine distance 36

Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance Minimize: 37

Training both modules iteratively Visual module Language module

Our results: 39

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 40

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 41

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 42

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 43

Relationship types: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 44

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 45

Ablation study Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only Visual + language

Ablation study person wear shirt person wear shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 0.07 0.09 0.04 Visual only Visual + language

Ablation study person wear shirt person wear shirt person in horse person in shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only 0.07 1.58 0.09 1.85 0.04 0.84 Visual + language

Ablation study person wear shirt person wear shirt Sadeghi et al. 2011 Recall @ 50 Recall @ 100 map 0.07 0.09 0.04 person in horse person ride horse person in shirt person near horse Visual only 1.58 1.85 0.84 Visual + language 13.86 14.76 1.52

person ride bicycle 50

person throw frisbee person throw frisbee 51

Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples 52

Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples person sit hydrant 0 training examples 53

Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples 54

Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples horse wear hat 0 training examples 55

Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University Poster #4 Questions? 56