Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University * = equal contribution
image #1 image #2 llama person llama person 2
next to chasing 3
4
Problem formulation Input (image only) 6
Problem formulation Input (image only) person person riding riding horse in front of horse Output 7
Problem formulation person person horse horse Input (image only) Output 8
Problem formulation Input (image only) person person riding riding horse horse Output 9
Problem formulation Input (image only) person person riding riding horse in front of horse Output 10
Related work Spatial relationships: cup on top of table Action relationships: person kick ball Common relationships: person wear shirt Roger et al. ICCV 2008 Galleguillos, CVPR 2008 Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009 Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR 2011 11
Visual Genome dataset 33K object categories 42K relationship categories dataset also contains descriptions, question answers and attributes Krishna et al. IJCV 2016 13
Observation 1: ride Quadratic explosion of - N objects, - K relationships leading to N2K detectors next to lying Visual Genome dataset N = 33K K = 42K drag falling off carry resting on throw 14
Observation #2 # of occurrences Long tail distribution of relationships - makes supervised training difficult relationships 15
Observation #2 # of occurrences car on wstreet dog behind w w tree dog ride skateboard Long tail distribution of relationships - makes supervised training difficult relationships 1 6
Observation #2 # of occurrences car on wstreet dog ride skateboard w elephant wdrink milk dog ride wsurfboard Long tail distribution of relationships - makes supervised training difficult relationships 1 7
Visual module Language module Input Tackles: Quadratic explosion of N2K detectors Tackles: Long tail distribution of relationships Output
Visual module Input Definitions: Output
Visual module Proposals: Uijlings et al. IJCV 2013 Input Definitions: Output
Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Output
Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]
Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple
Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple
Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple
Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple
Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person riding horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple
Visual module Proposals: Uijlings et al. IJCV 2013 Sample: Tackles: object detector relationship detector Quadratic explosion only requires N+K detectors
Language module Tackles: Long tail distribution can predict rare relationships o 1 : man r: ride o 2 : horse
Training the visual module 1. Pre-train using ImageNet object detector object detector relationship detector Definitions: Deng et al. 2009
Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Girshirk et al. CVPR 2014
Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]
Training the visual module object detector relationship detector object detector Ranking loss 1. 2. 3. 4. Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] Deng et al. 2009
Training the language module dog ride skateboard w dog ride wsurfboard 34
Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance 35
Training the language module dog ride skateboard w 0 0 dog ride wsurfboard where cos is the cosine distance 36
Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance Minimize: 37
Training both modules iteratively Visual module Language module
Our results: 39
Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 40
Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 41
Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 42
Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 43
Relationship types: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 44
Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 45
Ablation study Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only Visual + language
Ablation study person wear shirt person wear shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 0.07 0.09 0.04 Visual only Visual + language
Ablation study person wear shirt person wear shirt person in horse person in shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only 0.07 1.58 0.09 1.85 0.04 0.84 Visual + language
Ablation study person wear shirt person wear shirt Sadeghi et al. 2011 Recall @ 50 Recall @ 100 map 0.07 0.09 0.04 person in horse person ride horse person in shirt person near horse Visual only 1.58 1.85 0.84 Visual + language 13.86 14.76 1.52
person ride bicycle 50
person throw frisbee person throw frisbee 51
Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples 52
Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples person sit hydrant 0 training examples 53
Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples 54
Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples horse wear hat 0 training examples 55
Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University Poster #4 Questions? 56