Supplementary Material for paper Co-segmentation Inspired Attention Networks for Video-based Person Re-identification

Advertisement
Similar documents
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Blog Post Extraction Using Title Finding

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Speed Performance Improvement of Vehicle Blob Tracking System

Customer Relationship Management using Adaptive Resonance Theory

Lecture 6: Classification & Localization. boris.

Neural Network based Vehicle Classification for Intelligent Traffic Control

A QoE Based Video Adaptation Algorithm for Video Conference

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

How To Solve The Kd Cup 2010 Challenge

Supply Chain Forecasting Model Using Computational Intelligence Techniques

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

4. GPCRs PREDICTION USING GREY INCIDENCE DEGREE MEASURE AND PRINCIPAL COMPONENT ANALYIS

Intrusion Detection via Machine Learning for SCADA System Protection

Development of a Web-based Information Service Platform for Protected Crop Pests

Steven C.H. Hoi School of Information Systems Singapore Management University

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

The Artificial Prediction Market

Mean-Shift Tracking with Random Sampling

Masters in Information Technology

Decision Trees from large Databases: SLIQ

An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

An Overview of Knowledge Discovery Database and Data mining Techniques

Intelligent Diagnose System of Wheat Diseases Based on Android Phone

Figure 1. The cloud scales: Amazon EC2 growth [2].

Getting insights about life cycle cost drivers: an approach based on big data inspired statistical modelling

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Design call center management system of e-commerce based on BP neural network and multifractal

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

Tracking and Recognition in Sports Videos

Contemporary Techniques for Data Mining Social Media

A Fraud Detection Approach in Telecommunication using Cluster GA

CONCEPTUAL MODEL OF MULTI-AGENT BUSINESS COLLABORATION BASED ON CLOUD WORKFLOW

Genetic Algorithm Based Interconnection Network Topology Optimization Analysis

Parallelized Architecture of Multiple Classifiers for Face Detection

Random Forest Based Imbalanced Data Cleaning and Classification

Fluency With Information Technology CSE100/IMT100

Jitendra Kumar Gupta 3

EFFICIENT JOB SCHEDULING OF VIRTUAL MACHINES IN CLOUD COMPUTING

PLAANN as a Classification Tool for Customer Intelligence in Banking

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Mobile Storage and Search Engine of Information Oriented to Food Cloud

In-memory databases and innovations in Business Intelligence

Tracking Groups of Pedestrians in Video Sequences

A Dynamic Approach to Extract Texts and Captions from Videos

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

Capability Service Management System for Manufacturing Equipments in

A Real Time Hand Tracking System for Interactive Applications

A Service Revenue-oriented Task Scheduling Model of Cloud Computing

Credit Card Fraud Detection Using Self Organised Map

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Dan French Founder & CEO, Consider Solutions

A user supported tracking framework for interactive video production

System Architecture of the System. Input Real time Video. Background Subtraction. Moving Object Detection. Human tracking.

Research of Smart Distribution Network Big Data Model

A New Method for Traffic Forecasting Based on the Data Mining Technology with Artificial Intelligent Algorithms

Multiscale Object-Based Classification of Satellite Images Merging Multispectral Information with Panchromatic Textural Features

Feature Construction for Time Ordered Data Sequences

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Data Mining: An Overview. David Madigan

Self Organizing Maps: Fundamentals

ADVANCED APPLICATIONS OF ELECTRICAL ENGINEERING

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

Data quality in Accounting Information Systems

Manual for Human Tracking Software

Research of Sales Contract Management System Based on WEB

Neovision2 Performance Evaluation Protocol

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Operating Systems. Virtual Memory

Tracking System for GPS Devices and Mining of Spatial Data

Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce

CS590D: Data Mining Chris Clifton

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

An Algorithm for Classification of Five Types of Defects on Bare Printed Circuit Board

UMass at TREC 2008 Blog Distillation Task

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Frsq: A Binary Image Coding Method

Get The Picture: Visualizing Financial Data part 1

Class Imbalance Learning in Software Defect Prediction

Accurate and robust image superresolution by neural processing of local image representations

Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER

Metaheuristics in Big Data: An Approach to Railway Engineering

Research and Implementation of View Block Partition Method for Theme-oriented Webpage

Search Result Optimization using Annotators

TEXT ANALYTICS INTEGRATION

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller


Very Low Frame-Rate Video Streaming For Face-to-Face Teleconference

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Fast Matching of Binary Features

LIBSVX and Video Segmentation Evaluation

Generation of Cloud-free Imagery Using Landsat-8

Video Encryption Exploiting Non-Standard 3D Data Arrangements. Stefan A. Kramatsch, Herbert Stögner, and Andreas Uhl

Advertisement
Transcription:

Supplementary Material for paper Co-segmentation Inspired Attention Networks for Video-based Person Re-identification Arulkumar Subramaniam, Athira Nambiar, Anurag Mittal Department of Computer Science and Engineering, Indian Institute of Technology Madras {aruls, anambiar, amittal}@cse.iitm.ac.in Pseudo-code of COSAM module The COSAM layer makes use of two subroutines viz. COSAM-SPATIAL-ATTENTION() & COSAM- CHANNEL-ATTENTION() to achieve the notion of co-segmentation as shown in the Algorithm 1. Algorithm 1 COSAM Layer Input: Frame-level feature maps after L th CNN Block, F L = { Fi L } N i=1 where sizeof(f L ) = N D L H L W L, i = Index of the frame in the video, N=Total number of frames, D L =Number of channels, H L =Height, W L =Width( {F Output: Co-segmented feature maps F L cosam = COSAM } ) L N i with same dimension as of F L i=1 1: procedure COSAM-SPATIAL-ATTENTION(F) F=Input feature maps to the subroutine 2: F R = conv 1x1(F) Dimension reduction step : D L D R 3: F R = F R.view as(n D R (H L W L )) Flatten the height and width dimension 4: Calculate cost volume for n th frame feature map ( Cost Volume C n(i, j) = {NCC ) F (i,j) R n, F (h,w) R m (1 m N, m n) (1 h H L ) (1 w W L )} sizeof(c) = N ((N 1) H L W L ) H L W L 5: Spatial mask M spatial = Sigmoid(conv 1x1(C)) Summarization step: ((N 1) H L W L ) 1 6: sizeof(m spatial ) = N 1 H L W L 7: Spatial attended features F spatial = M spatial F = point-wise multiplication 8: return F spatial 9: procedure COSAM-CHANNEL-ATTENTION(F) F=Input feature maps to the subroutine 10: F GAP = GAP (F) Global average pooling (GAP), sizeof(f GAP ) = N D L 11: Individual channel mask M ind = Sigmoid(MLP (F GAP )) sizeof(m ind ) = N D L 12: 13: Common channel activation mask M channel = avg-pool(m ind ) Masked channel features F channel = M channel F sizeof(m channel ) = 1 D L = Channel-wise multiplication 14: return F channel 15: procedure COSAM(F L ) 16: F L spatial = COSAM-SPATIAL-ATTENTION(FL ) invoke spatial attention procedure 17: F L cosam = F L spatial + COSAM-CHANNEL-ATTENTION(FL spatial ) invoke channel attention procedure 18: return F L cosam

1. Ablation study (continued) In this section, the best performing model SE- ResNet50+COSAM 4,5 + TP avg is further subjected to ablation studies with respect to Selection of frames, Cross-dataset performances and Effect of spatial vs. Channel attention in COSAM. 1.1. Selection of frames We perform experiments with two schemes of frame selection during training namely: 1) Sequential: a continuous sequence of N frames are selected, 2) Random: random sampled N frames from the whole video sequence. The quantitative results are shown in Table 1. It is observed that the performance of training using randomly sampled frames is inferior to the performance of training using sequentially sampled frames. frame selection MARS DukeMTMC-VideoReID map R1 R5 R20 map R1 R5 R20 sequential 79.9 84.9 95.5 97.9 94.1 95.4 99.3 99.8 random 77.5 83.3 93.6 97.0 90.2 90.6 98.3 99.6 Table 1. Evaluation of the influence of frame selection on Re-ID performance of the best performing model SE- ResNet50+COSAM 4,5+TP avg. 1.2. Cross-dataset test performance We analyze the cross-dataset performance of the best performing model (SE-ResNet50+COSAM 4,5 + TP avg ), against a base-model without using COSAM layer. The results are shown in Table 2. By training on MARS dataset and testing on DukeMTMC-VideoReID dataset, the former model outperforms the latter by 2.8% map and 3.5% CMC Rank-1. Similar performance of COSAMbased model is observed while training on DukeMTMC- VideoReID dataset and testing on MARS dataset as well (Improvement of 0.9% in map and 0.7% in CMC Rank-1). Train set Test set map R1 R5 R20 No COSAM MARS DukeMTMC 32.0 33.3 53.3 67.1 COSAM 4,5 MARS DukeMTMC 34.8 36.8 54.1 67.9 No COSAM DukeMTMC MARS 25.0 41.7 54.4 65.3 COSAM 4,5 DukeMTMC MARS 25.9 42.4 56.0 65.8 Table 2. Cross-dataset performance of the best performing model with SE-ResNet50 as the feature extractor and TP avg as the temporal aggregation layer. Here DukeMTMC = DukeMTMC- VideoReID. Attention layer MARS DukeMTMC-VideoReID map R1 R5 R20 map R1 R5 R20 Only spatial attention 78.8 84.1 94.9 97.7 93.6 93.9 99.0 99.9 Only Channel attention 79.0 84.3 95.0 97.8 93.8 94.4 99.1 99..7 Both spatial and channel 79.9 84.9 95.5 97.9 94.1 95.4 99.3 99.8 Table 3. Evaluation of the influence of Co-segmentation based attention layers on Re-ID performance of the best performing model SE-ResNet50+COSAM 4,5+ TP avg. 1.3. Spatial vs. Channel attention To evaluate the effect of the attention steps in the COSAM layer, we investigate the individual attention steps (Spatial and Channel). The results in Table 3 shows that the inclusion of both spatial and channel attention steps together achieves superior performance. 2. Location of COSAM module (multiple COSAM) In addition to the Table 3 in the main paper, we experiment by inserting COSAM layer simultaneously after multiple CNN blocks and report the performance in Table 4. It can be observed that by adding our plug-and-play COSAM layer at multiple positions in the network (especially at deeper layers), the system performance keeps on improving. Keeping the computation overload and memory efficiency in mind, COSAM 4,5 variant of the model with feature extractor as SE-ResNet50 is selected to be the de facto model for other experiments. ResNet50 SE-ResNet50 MARS DukeMTMC-VideoReID COSAM i map R1 R5 R20 map R1 R5 R20 No COSAM [1] 75.8 83.1 92.8 96.8 92.9 93.6 99.0 99.7 COSAM 2 68.3 77.7 90.1 96.1 88.9 90.2 98.4 99.0 COSAM 3 76.9 82.7 94.3 97.3 93.6 94.0 98.7 99.9 COSAM 4 76.8 82.9 94.2 97.1 93.8 94.7 98.7 99.7 COSAM 5 76.6 82.8 93.9 97.2 93.2 93.7 98.4 99.9 COSAM 3,4 76.4 83.4 93.9 97.1 93.7 94.4 99.1 99.4 COSAM 3,5 76.9 83.7 94.0 97.3 93.0 93.7 99.0 99.7 COSAM 4,5 77.2 83.7 94.1 97.5 94.0 94.4 99.1 99.9 COSAM 3,4,5 76.6 83.2 93.7 97.3 93.1 93.6 98.7 99.4 No COSAM 78.3 84.0 95.2 97.1 93.5 93.7 99.0 99.7 COSAM 2 67.0 77.9 90.4 94.9 92.2 94.0 98.9 99.7 COSAM 3 79.5 85.0 94.7 97.8 93.6 94.7 99.0 99.9 COSAM 4 79.8 84.9 95.4 97.8 94.0 95.4 99.0 99.9 COSAM 5 79.9 84.5 95.7 97.9 93.9 94.9 99.1 99.9 COSAM 3,4 79.5 84.8 94.7 97.6 93.7 94.7 98.7 99.7 COSAM 3,5 79.8 85.2 95.5 98.0 93.9 94.2 99.3 99.9 COSAM 4,5 79.9 84.9 95.5 97.9 94.1 95.4 99.3 99.8 COSAM 3,4,5 80.5 85.2 95.5 98.0 94.1 95.4 99.3 99.9 Table 4. Evaluation of the backbone feature extractors with COSAM and temporal aggregation layer as TP avg. Here, COSAM i = plugging in COSAM layer after i th CNN block. 3. State-of-the-art comparison on ilids-vid dataset We observe that SE-ResNet50 (COSAM 4,5 ) along with TP avg aggregation layer outperform the nearest competing method [2] by 2.5% map CMC Rank-1 (Table 5). Method ilids-vid R1 R5 R20 Top push video Re-ID [4] 56.3 87.6 98.3 JST-RNN[5] 55.2 86.5 97.0 Joint ST pooling [3] 62.0 86.0 98.0 Region QEN[2] 77.1 93.2 99.4 RevisitTempPool[1] 73.9 92.6 98.41 [1] + SE-ResNet50 + TP avg 76.87 93.94 99.07 SE-ResNet50 + COSAM 4,5 + TP avg(ours) 79.61 95.32 99.8 Table 5. Comparison of the state-of-the-arts in ilids-vid dataset. 4. Visualization of attention masks We visualize the spatial masks obtained from the test sets of DukeMTMC-VideoReID and MARS in Figures 1, 2, 3, 4.

(e) (f) (g) (h) Figure 1. Visualization results. In each column, the first row shows the image frames of a person from DukeMTMC-VideoReID dataset. The second, third and fourth rows show the corresponding co-segmentation maps at COSAM3, COSAM4 and COSAM5 layers respectively, in SE-ResNet50+COSAM3,4,5 model trained on DukeMTMC-VideoReID dataset. References [1] J. Gao and R. Nevatia. Revisiting temporal modeling for video-based person reid. CoRR, abs/1805.02104, 2018. 2 [2] G. Song, B. Leng, Y. Liu, C. Hetang, and S. Cai. Regionbased quality estimation network for large-scale person reidentification. In Thirty-Second AAAI Conference on Artificial

(e) (f) (g) (h) Figure 2. Visualization results. In each column, the first row shows the image frames of a person from MARS dataset. The second, third and fourth rows show the corresponding co-segmentation maps at COSAM3, COSAM4 and COSAM5 layers respectively, in SEResNet50+COSAM3,4,5 model trained on MARS dataset. Intelligence, 2018. 2 [3] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video- based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 4733 4742, 2017. 2

(e) (f) (g) (h) (i) (j) (k) (l) Figure 3. Visualization results. The second row shows the co-segmentation maps (COSAM4 layer in SE-ResNet50+COSAM4,5 model trained on MARS dataset) corresponding to the images in the first row. Figure 4. Visualization of some of the failure cases in the COSAM spatial attention step disregarding certain part informations. The second row shows the co-segmentation maps (COSAM4 layer in SE-ResNet50+COSAM4,5 model trained on MARS dataset) corresponding to the images in the first row.

[4] J. You, A. Wu, X. Li, and W.-S. Zheng. Top-push video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1345 1353, 2016. 2 [5] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4747 4756, 2017. 2