CNN Based Object Detection in Large Video Images. WangTao, IQIYI ltd

Size: px

Start display at page:

Download "CNN Based Object Detection in Large Video Images. WangTao, wtao@qiyi.com IQIYI ltd. 2016.4"

Nathan Conley
7 years ago
Views:

1 CNN Based Object Detection in Large Video Images WangTao, IQIYI ltd

2 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body segmentation Same style matching Experiments Conclusion

3 Video out applications Background Image retrieval Video advertising

4 Challenge Real video data vs. image dataset - Clutter background - Multiple objects - Small objects - Variant pose/position - Partial occlusion

5 Our task Problems: Content based object retrieval in large video images High accuracy for same style matching High speed in large video database Solution: Accurate object detection + scene classification Discriminated DNN features and PCA/LDA transformation Speed up by parallel indexing and hierarchical filtering

Accurate object detection + scene classification Discriminated DNN features

6 System framework Video key frame Scene Classification Object detection Body segmentation CNN feature Indexing Database indexing Scene Classification Query image Faster-RCNN rect Body segmentation CNN feature Match query Distance sort Result

Database indexing Scene Classification Query image

7 Object detection (I) Object detection by faster-rcnn Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al. NIPS2015] Trained on MS coco db (300k images) + video images (10k images) More pervasive and general for images with multi-objects

8 Multi-class object detection including Clothes(skirt,jacket,trousers) Bags(handbag, backpack, draw-bar box ) Electronics (mobile, laptop,tv,keyboard,mouse, microwave oven, oven, refrigerator ) Glasses, necklace, hat Shoes

9 Object detection (II) Object detection by CNN regression Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014] Efficient for images with single object, not recognized by faster-rcnn

10 Body Segmentation Constraint by human body parts CNN based body segmentation [Jonathan Long,CVPR2015] Bounding box, body mask, body parsing original image segmentation image

yes/no CNN absed Scene classification Multi-frame fusion tags Scene

11 Scene classification CNN based Scene classification [Bolei Zhou, NIPS2014] Video Key frame Is Scene? yes/no CNN absed Scene classification Multi-frame fusion tags Scene classification Preciosn:65.8% Recall:74% Non scene images Scene images of kitchen, office, living room, and bedroom Preciosn:83.8% Recall:56.7%

12 Scene classes 0 kitchen 1 dining 2 bakery 3 ice_cream_parlor 4 bathroom 5 washing_room 6 bedroom 7 living_room 8 office 9 children_room 10 nursery 11 toyshop 12 shoe_shop 13 jewelry_shop 14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other

swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29

classifier Model :VGG19 Layers : fc7 Dimension : 4096 600 MAP 28% CNN

13 Same style matching SIFT feature matching Normalization of SIFT Dimension : 128dim x 400pts MAP 22% CNN feature of imagenet 1k classifier Model :VGG19 Layers : fc7 Dimension : MAP 28% CNN feature of Same style classifier Model :VGG19 Layers : fc7 Dimension : MAP 34%

14 Multi-feature fusion Same class matching classifier on imagenet 21k classes of 15M images Same style matching classifier trained on 1239 queries of 1M images CNN Models Feature dim MAP Inception_bn1k % Inception_21k % Vgg19_caffe % Inception_21k + vgg19_caffe % Speed Nvidia K40 GPU, 10x faster than CPU i7 Faster RCNN speed: 200ms/frame, image size 1920x1080 Vgg19 feature speed: 60ms/frame, image size 256x256

Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43% Speed Nvidia K40 GPU, 10x

15 Experiments MAP precision on 3M testing images, trained on1m images Vgg 19model Full image Object rectangle PCA+LDA Inception-21k MAP 27.8% 34.2% 37.3% 43.1% 46.1% Speed up Parallel flann tree indexing Hierarchical filtering by object classes, 10x faster speed Query speed: 1s /image on 5000 teleplays with 2M images

16 Query system GUI

17 Query examples on image dataset

19 Query examples on video dataset

21 Conclusion Bounding box is important to recognize object Fusion Same style matching with same class matching features to get higher accuracy PCA and LDA further improve accuracy and speed GPU is faster for CNN feature extraction Speed up query by parallel indexing and hierarchical filtering

22 References Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 arxiv: Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr ICCV Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, Clinical Orthopaedics and Related Research, 2015 Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene Recognition using Places Database, NIPS, 2014 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, ICLR, 2015 Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN features for Scene Classification, ICCV, 2015 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception Architecture for Computer Vision, arxiv: ,2015

Bert Huang Department of Computer Science Virginia Tech

Bert Huang Department of Computer Science Virginia Tech This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students