Presentation outline. An In-Depth Evaluation of Multimodal Video Genre Categorization. State-of-the-art many approaches (more than 10 years), e.g.

Transcription

1 An In-Depth Evaluation of Multimodal Video Genre Categorization Presentation outline Problem statement, state-of-the-art and contribution Video content description Ionuț MIRONICĂ 1 imironica@imag.pub.ro Bogdan IONESCU 1,2 bionescu@imag.pub.ro Peter KNEES 3 peter.knees@jku.at Patrick LAMBERT 2 Data fusion Experimental results 11th International Workshop on Content-Based Multimedia Indexing,, Veszprém, Hungary, June 17-19, Conclusions 1 University POLITEHNICA of Bucharest Problem statement data indexing is based on extracting content-based descriptors, (numeric/compact representation of rich text-audio-visual information); content annotation (key process) goal: determine relevant content descriptors to facilitate automatic applications: classification into predefined categories,i.e., video genres: automatic categorization of videos (e.g., TV programs, video selling platforms); genre based visualization of web media (e.g., YouTube, blip.tv) State-of-the-art many approaches (more than 10 years), e.g.,: [X. Yuan, W. Lai, T. Mei, X.S. Hua, X.Q. Wu, S. Li 06] spatio-temporal - annotation: temporal (avg. shot length, cut %, camera motion) & spatial (face frames %, avg. brightness, color entropy); - classifier: Decision Trees & several s; - genres: - movie, commercial, news, music & sports - movies: action, comedy, horror & cartoon - sports: baseball, football, volleyball, tennis, basketball & soccer [Y. Song, Y.-D. Zhang, X. Zhang, J. Cao, J.-T. Li 09] only text - annotation: contextual and social information: metadata, user behavior, viewer s behavior and video relevance; - classifier: incremental ; - genres: large-scale web categorization. 3 4 State-of-the-art 2 [M. Montagnuolo, A. Messina 09] multi-modal - annotation: visual-perceptual (colour, texture, motion), temporal (shot length, distribution, rhythm, etc.), cognitive (face properties) & aural (text, sound caracteristics); - classifier: parallel Neural Network system; - genres: football, cartoons, music, weather forecast, newscast, talk shows & commercials. [S. Schmiedeke, C. Kofler, I. Ferrane 12] truly multi-modal Genre Tagging MediaEval Benchmarking Initiative for Multimedia Evaluation (2011, 2012) evaluation framework. - annotation: aural, visual, text from ASR, text from web metadata; - genres: 26 blip.tv genres, Internet videos; Contribution to the state-of-the-art in this context of the state-of-the-art, we attempt to respond to several research questions: to what extent aural and visual information (which can be extracted automatically) can lead to similar performance or even surpass the semantic textual descriptors? how efficient would be an adequate combination of various modalities in achieving highly accurate classification? how really important is the contribution of visual modalities in improving the accuracy of using textual data? 5

2 Approach > challenge: find a way to assign (genre) tags to unknown videos; > approach: machine learning paradigm; labeled data web food autos classifier train unlabeled data labeled data + Content description - audio Standard audio features (audio frame-based) f 1 f 2 f n var{f 2 } var{f n } time global feature = mean & variance Zero-Crossing Rate, Predictive Coefficients, Line Spectral Pairs, Mel-Frequency Cepstral Coefficients, spectral centroid, flux, rolloff, and kurtosis, + variance of each feature over a certain window. tagged video video database 7 [B. Mathieu et al., Yaafe toolbox, ISMIR 10, Netherlands] MPEG-7 & color/texture descriptors (visual frame-based) f 1 f 2 f n time global feature = mean & dispersion & skewness & kurtosis & median & root mean square Local Binary Pattern, Autocorrelogram, Color Coherence Vector, Color Layout Pattern, Edge Histogram, Classic color histogram, Scalable Color Descriptor, Color moments. descriptors (Bag-of-Words) dictionary of 4,096 words; rgbsift and spatial pyramids (2x2); Detection on interest points Codewords Dictionary Generate BoW histograms Train classifier [OpenCV toolbox, [CIVR 2009, J. Uijlings et all] Histogram of oriented Gradients - HoG divides the image into 3x3 cells and for each of them builds a pixel-wise histogram of edge orientations. Structural descriptors describes structural information in terms of contours and their relations ( scalespace representation); σ=1 σ=3 b : degree of curvature (proportional to the maximum amplitude of the bowness space); straight vs. bow ζ : degree of circularity; ½ circle vs. full circle e : edginess parameter zig-zag vs. sinusoid; y : symmetry parameter irregular vs. even edginess symmetry [CITS 2009, O. Ludwig,et all] [IJCV, C. Rasche 10] 12

3 Content description - textual TF-IDF descriptors (Term Frequency-Inverse Document Frequency) text sources: ASR and metadata from Internet, 1. remove XML markups, Data fusion multimodal integration Early fusion: 2. remove terms <5%-percentile of the frequency distribution, Global Descriptor Classifier Global Confidence score Decision 3. select term corpus: retaining for each genre class m terms (e.g., m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes, 4. video descriptor: the TF-IDF values. extraction Normalization concatenation Classification Step Obtain the Global Confidence Score 14 Data fusion multimodal integration 2 Late fusion: Classifier 1 Classifier 2 Confidence value 1 Confidence value 2 cv 1 cv 2 design the aggregation function Global Confidence score Decision Experimental results Data set: MediaEval 2012 Genre Tagging Task 14,838 episodes from 2,249 shows (3,260 hours); 26 video genres (art, autos, business, comedy, gaming...); Classifier n Confidence value n cv n extraction Classification Step Confidence Scores Normalization Global Confidence Score we test: Experimental results 2 Classification scheme: - we have selected five of the most popular approaches: Support Vector Machines with linear, Radial Basis Function and Chi-square kernels; k-nearest Neighbour; Random and Extremely Random ; - we perform training on 5,288 videos and testing on 9,550; - classifier parameters and late fusion weights were optimized on training dataset. Evaluation metrics: Mean Average Precision summarizes rankings from multiple queries by averaging average precision; Experimental results 3 RBF - Chi 5-NN Random s Performance of visual descriptors: - best performance with MPEG-7 (ERF) and HOG (-RBF); - Bag-of-Words is not performing very well! 17 18

4 Experimental results 4 RBF - Chi 5-NN Random s Performance of audio descriptors: - best performance with Extremely Random s; - provide higher discriminative power than visual features. 19 Experimental results 5 RBF - Chi 5-NN Random s Performance of text descriptors: - best performance with metadata and Random ; - ASR provides lower performance than the use of audio descriptors; - metadata TF-IDF outperforms all the other approaches. 20 Experimental results 6 (2) Performance of multimodal Integration SUM Mean MNZ Rank Early Fusion all visual 35.82% 36.76% 38.21% 30.90% 30.11% all audio 43.86% 44.19% 44.50% 41.81% 42.33% all text 62.62% 62.81% 62.69% 50.60% 55.68% audio-visualtext 64.24% 65.61% 65.82% 53.84% 60.12% Performance of fusion techniques: - late fusion provides higher performance than early fusion; - the use of all modalities is better; - MNZ tends to provide the most accurate results, MAP is up to 65.82% which is quite significant; 21 Experimental results 7 (3) Comparison to state-of-the-art (from MediaEval 2012) Team Modality Method MAP proposed all Late Fusion MNZ with all descriptors 65.82% proposed text Late Fusion Mean with TF-IDF of ASR and metadata 62.81% TUB text Naive Bayes with Bag of Words on text (metadata) 52.25% proposed all Late Fusion MNZ with all descriptors except for metadata 51.9% proposed audio Late Fusion Mean with standard audio descriptors 44.50% proposed visual Late Fusion Mean with MPEG-7 related, structural, HoG and B-o-VW with rgbsift 38.21% ARF text linear on early fusion of TF-IDF of ASR and metadata 37.93% TUD visual & text Late Fusion of with B-o-W (visual word, ASR & metadata) 35.81% KIT visual with Visual descriptors (color, texture, B-o-VW with rgbsift) 35.81% TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00% UNICAMP - UFMG visual Late fusion (KNN, Naive Bayes,, Random s) with BOW (text ASR) 21.12% ARF audio linear with block-based audio features 18.92% 22 Experimental results 8 (3) Comparison to state-of-the-art (from MediaEval 2012) Team Modality Method MAP proposed all Late Fusion MNZ with all descriptors 65.82% proposed text Late Fusion Mean with TF-IDF of ASR and metadata 62.81% TUB text Naive Bayes with Bag of Words on text (metadata) 52.25% proposed all Late Fusion MNZ with all descriptors except for metadata 51.9% metadata provides the highest discriminative power but cannot be generated automatically from video contents MAP 52.25%; proposed audio Late Fusion Mean with standard audio descriptors 44.50% proposed visual Late Fusion Mean with MPEG-7 related, structural, HoG and 38.21% B-o-VW with rgbsift the use of automatic content descriptors and late fusion allow for similar performance MAP 51.9% (surpassing even some metadata based approaches); ARF text linear on early fusion of TF-IDF of ASR and metadata 37.93% TUD visual & text Late Fusion of with B-o-W (visual word, ASR & metadata) 35.81% KIT visual with Visual descriptors (color, texture, B-o-VW with rgbsift) 35.81% TUD-MM text Dynamic Bayesian networks on text (ASR & metadata) 25.00% the inclusion of audio-visual information improves performance of text, visual 21.12% which is also the best performing approach MAP 65.82% UNICAMP - UFMG Late fusion (KNN, Naive Bayes,, Random s) with BOW (text ASR) ARF audio linear with block-based audio features 18.92% 23 Conclusions (1) we provided an in-depth evaluation of truly multimodal video description approaches; (2) we demonstrated the potential of appropriate late fusion to genre categorization; (3) we proved that notwithstanding the superiority of user-text based descriptors, late fusion can boost performance of automated content descriptors to achieve close performance; (4) we setup a new baseline for the 2012 Genre Tagging Task by outperforming the performance of the other participants; Acknowledgements: - we thank Prof. Nicu Sebe and Dr. Jasper Uijlings from University of Trento for their support. - we also acknowledge the 2012 Genre Tagging Task of the MediaEval Multimedia Benchmark for the dataset (

5 Thank you! Questions? 25