Visual Relationship Detection with Language Priors

Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University * = equal contribution

image #1 image #2 llama person llama person 2

next to chasing 3

Problem formulation Input (image only) 6

Problem formulation Input (image only) person person riding riding horse in front of horse Output 7

Problem formulation person person horse horse Input (image only) Output 8

Problem formulation Input (image only) person person riding riding horse horse Output 9

Problem formulation Input (image only) person person riding riding horse in front of horse Output 10

Related work Spatial relationships: cup on top of table Action relationships: person kick ball Common relationships: person wear shirt Roger et al. ICCV 2008 Galleguillos, CVPR 2008 Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009 Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR 2011 11

Visual Genome dataset 33K object categories 42K relationship categories dataset also contains descriptions, question answers and attributes Krishna et al. IJCV 2016 13

Observation 1: ride Quadratic explosion of - N objects, - K relationships leading to N2K detectors next to lying Visual Genome dataset N = 33K K = 42K drag falling off carry resting on throw 14

Observation #2 # of occurrences Long tail distribution of relationships - makes supervised training difficult relationships 15

Observation #2 # of occurrences car on wstreet dog behind w w tree dog ride skateboard Long tail distribution of relationships - makes supervised training difficult relationships 1 6

Observation #2 # of occurrences car on wstreet dog ride skateboard w elephant wdrink milk dog ride wsurfboard Long tail distribution of relationships - makes supervised training difficult relationships 1 7

Visual module Language module Input Tackles: Quadratic explosion of N2K detectors Tackles: Long tail distribution of relationships Output

Visual module Input Definitions: Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Definitions: Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Output

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person riding horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

Visual module Proposals: Uijlings et al. IJCV 2013 Sample: Tackles: object detector relationship detector Quadratic explosion only requires N+K detectors

Language module Tackles: Long tail distribution can predict rare relationships o 1 : man r: ride o 2 : horse

Training the visual module 1. Pre-train using ImageNet object detector object detector relationship detector Definitions: Deng et al. 2009

Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Girshirk et al. CVPR 2014

Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

Training the visual module object detector relationship detector object detector Ranking loss 1. 2. 3. 4. Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] Deng et al. 2009

Training the language module dog ride skateboard w dog ride wsurfboard 34

Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance 35

Training the language module dog ride skateboard w 0 0 dog ride wsurfboard where cos is the cosine distance 36

Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance Minimize: 37

Training both modules iteratively Visual module Language module

Our results: 39

Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 40

Relationship types: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 44

Ablation study Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only Visual + language

Ablation study person wear shirt person wear shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 0.07 0.09 0.04 Visual only Visual + language

Ablation study person wear shirt person wear shirt person in horse person in shirt Recall @ 50 Recall @ 100 map Sadeghi et al. 2011 Visual only 0.07 1.58 0.09 1.85 0.04 0.84 Visual + language

Ablation study person wear shirt person wear shirt Sadeghi et al. 2011 Recall @ 50 Recall @ 100 map 0.07 0.09 0.04 person in horse person ride horse person in shirt person near horse Visual only 1.58 1.85 0.84 Visual + language 13.86 14.76 1.52

person ride bicycle 50

person throw frisbee person throw frisbee 51

Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples 52

Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples person sit hydrant 0 training examples 53

Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples 54

Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples horse wear hat 0 training examples 55

Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University Poster #4 Questions? 56