Visual Relationship Detection with Language Priors

Size: px

Start display at page:

Download "Visual Relationship Detection with Language Priors"

Kerry Craig
7 years ago
Views:

1 Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University * = equal contribution

2 image #1 image #2 llama person llama person 2

3 next to chasing 3

4 4

6 Problem formulation Input (image only) 6

7 Problem formulation Input (image only) person person riding riding horse in front of horse Output 7

8 Problem formulation person person horse horse Input (image only) Output 8

9 Problem formulation Input (image only) person person riding riding horse horse Output 9

10 Problem formulation Input (image only) person person riding riding horse in front of horse Output 10

11 Related work Spatial relationships: cup on top of table Action relationships: person kick ball Common relationships: person wear shirt Roger et al. ICCV 2008 Galleguillos, CVPR 2008 Yao et al. CVPR 2012 Maji et al. CVPR 2011 Rohrbach et al. ICCV 2013 Gupta et al. PAMI 2009 Gupta et al. ECCV 2008 Kumar et al. CVPR 2010 Wang et al. ECCV 2016 Sadeghi et al. CVPR

13 Visual Genome dataset 33K object categories 42K relationship categories dataset also contains descriptions, question answers and attributes Krishna et al. IJCV

14 Observation 1: ride Quadratic explosion of - N objects, - K relationships leading to N2K detectors next to lying Visual Genome dataset N = 33K K = 42K drag falling off carry resting on throw 14

15 Observation #2 # of occurrences Long tail distribution of relationships - makes supervised training difficult relationships 15

16 Observation #2 # of occurrences car on wstreet dog behind w w tree dog ride skateboard Long tail distribution of relationships - makes supervised training difficult relationships 1 6

17 Observation #2 # of occurrences car on wstreet dog ride skateboard w elephant wdrink milk dog ride wsurfboard Long tail distribution of relationships - makes supervised training difficult relationships 1 7

18 Visual module Language module Input Tackles: Quadratic explosion of N2K detectors Tackles: Long tail distribution of relationships Output

19 Visual module Input Definitions: Output

20 Visual module Proposals: Uijlings et al. IJCV 2013 Input Definitions: Output

21 Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Output

22 Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

23 Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector Output Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

24 Visual module Proposals: Uijlings et al. IJCV 2013 Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

25 Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

26 Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person in horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

27 Visual module Language module Proposals: Uijlings et al. IJCV 2013 o1: man r: ride o2: horse Input Sample: object detector relationship detector person riding horse Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] T is a <o1, r, o2> triple

28 Visual module Proposals: Uijlings et al. IJCV 2013 Sample: Tackles: object detector relationship detector Quadratic explosion only requires N+K detectors

29 Language module Tackles: Long tail distribution can predict rare relationships o 1 : man r: ride o 2 : horse

30 Training the visual module 1. Pre-train using ImageNet object detector object detector relationship detector Definitions: Deng et al. 2009

31 Training the visual module 1. Pre-train using ImageNet 2. Train object detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] Girshirk et al. CVPR 2014

32 Training the visual module 1. Pre-train using ImageNet 2. Train object detector 3. Train relationship detector object detector object detector relationship detector Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ]

33 Training the visual module object detector relationship detector object detector Ranking loss Pre-train using ImageNet Train object detector Train relationship detector Fine-tune both jointly Definitions: b1, b2 are object proposals o1, o2 [person, horse, ] r [on, in, ride, front of, ] Deng et al. 2009

34 Training the language module dog ride skateboard w dog ride wsurfboard 34

35 Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance 35

36 Training the language module dog ride skateboard w 0 0 dog ride wsurfboard where cos is the cosine distance 36

37 Training the language module dog ride skateboard w dog ride wsurfboard where cos is the cosine distance Minimize: 37

38 Training both modules iteratively Visual module Language module

39 Our results: 39

40 Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 40

41 Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 41

42 Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 42

43 Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 43

44 Relationship types: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 44

45 Our results: spatial, comparative, asymmetrical, verb, prepositional person wear taller than person left of on wear shirt snow ski 45

46 Ablation study map Sadeghi et al Visual only Visual + language

47 Ablation study person wear shirt person wear shirt map Sadeghi et al Visual only Visual + language

48 Ablation study person wear shirt person wear shirt person in horse person in shirt map Sadeghi et al Visual only Visual + language

49 Ablation study person wear shirt person wear shirt Sadeghi et al map person in horse person ride horse person in shirt person near horse Visual only Visual + language

50 person ride bicycle 50

51 person throw frisbee person throw frisbee 51

52 Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples 52

53 Zero shot detection person sit chair 948 training examples hydrant on ground 29 training examples person sit hydrant 0 training examples 53

54 Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples 54

55 Zero shot detection person ride horse 578 training examples person wear hat 1023 training examples horse wear hat 0 training examples 55

56 Visual Relationship Detection with Language Priors Cewu Lu*, Ranjay Krishna*, Michael Bernstein, Li Fei-Fei Stanford University Poster #4 Questions? 56

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 3 Object detection