CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4
Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body segmentation Same style matching Experiments Conclusion
Video out applications Background Image retrieval Video advertising
Challenge Real video data vs. image dataset - Clutter background - Multiple objects - Small objects - Variant pose/position - Partial occlusion
Our task Problems: Content based object retrieval in large video images High accuracy for same style matching High speed in large video database Solution: Accurate object detection + scene classification Discriminated DNN features and PCA/LDA transformation Speed up by parallel indexing and hierarchical filtering
System framework Video key frame Scene Classification Object detection Body segmentation CNN feature Indexing Database indexing Scene Classification Query image Faster-RCNN rect Body segmentation CNN feature Match query Distance sort Result
Object detection (I) Object detection by faster-rcnn Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al. NIPS2015] Trained on MS coco db (300k images) + video images (10k images) More pervasive and general for images with multi-objects
Multi-class object detection including Clothes(skirt,jacket,trousers) Bags(handbag, backpack, draw-bar box ) Electronics (mobile, laptop,tv,keyboard,mouse, microwave oven, oven, refrigerator ) Glasses, necklace, hat Shoes
Object detection (II) Object detection by CNN regression Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014] Efficient for images with single object, not recognized by faster-rcnn
Body Segmentation Constraint by human body parts CNN based body segmentation [Jonathan Long,CVPR2015] Bounding box, body mask, body parsing original image segmentation image
Scene classification CNN based Scene classification [Bolei Zhou, NIPS2014] Video Key frame Is Scene? yes/no CNN absed Scene classification Multi-frame fusion tags Scene classification Preciosn:65.8% Recall:74% Non scene images Scene images of kitchen, office, living room, and bedroom Threshold@0.7 Preciosn:83.8% Recall:56.7%
Scene classes 0 kitchen 1 dining 2 bakery 3 ice_cream_parlor 4 bathroom 5 washing_room 6 bedroom 7 living_room 8 office 9 children_room 10 nursery 11 toyshop 12 shoe_shop 13 jewelry_shop 14 outdoor_ice_world 15 indoor_ice_skating_rink 16 baseball 17 football 18 basketball_court 19 swimming_pool 20 track 21 bowling_alley 22 billiards 23 tennis 24 volleyball 25 gymnasium 26 pleasure_ground 27 hospital_room 28 dentists 29 drugstore 30 music_studio 31 music_store 32 sandbeach 33 hairsalon 34 bar 35 pagoda 36 bamboo_forest 37 mountain 38 coast 39 creek 40 waterfall 41 grass 42 other
Same style matching SIFT feature matching Normalization of SIFT Dimension : 128dim x 400pts MAP 22% CNN feature of imagenet 1k classifier Model :VGG19 Layers : fc7 Dimension : 4096 600 MAP 28% CNN feature of Same style classifier Model :VGG19 Layers : fc7 Dimension : 4096 600 MAP 34%
Multi-feature fusion Same class matching classifier on imagenet 21k classes of 15M images Same style matching classifier trained on 1239 queries of 1M images CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43% Speed Nvidia K40 GPU, 10x faster than CPU i7 Faster RCNN speed: 200ms/frame, image size 1920x1080 Vgg19 feature speed: 60ms/frame, image size 256x256
Experiments MAP precision on 3M testing images, trained on1m images Vgg 19model Full image Object rectangle PCA+LDA Inception-21k MAP 27.8% 34.2% 37.3% 43.1% 46.1% Speed up Parallel flann tree indexing Hierarchical filtering by object classes, 10x faster speed Query speed: 1s /image on 5000 teleplays with 2M images
Query system GUI
Query examples on image dataset
Query examples on video dataset
Conclusion Bounding box is important to recognize object Fusion Same style matching with same class matching features to get higher accuracy PCA and LDA further improve accuracy and speed GPU is faster for CNN feature extraction Speed up query by parallel indexing and hierarchical filtering
References Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in Neural Information Processing Systems. 2015. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. Arandjelović, Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012. Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 arxiv:1411.4038. Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr ICCV 2015. Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, Clinical Orthopaedics and Related Research, 2015 Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene Recognition using Places Database, NIPS, 2014 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, ICLR, 2015 Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN features for Scene Classification, ICCV, 2015 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception Architecture for Computer Vision, arxiv:1512.00567,2015