Event Detection in Basketball Video Using Multiple Modalities Min Xu, Ling-Yu Duan, Changsheng Xu, *Mohan Kankanhalli, Qi Tian Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore, 119613 {xumin, lingyu, xucs, tian}@i2r.a-star.edu.sg *School of Computing, National University of Singapore, Singapore, 117553 mohan@comp.nus.edu.sg Abstract Semantic sports video analysis has attracted more and more attention recently. In this paper, we present a basketball event detection method by using multiple modalities. Instead of using low-level features, the proposed method is built upon visual and auditory midlevel features i.e. semantic shot classes and audio keywords. Promising event detection results have been achieved. By heuristically mapping semantic shot classes and by aligning audio keywords with semantic shot classes, we have been able to detect nine basketball events. Experimental results have shown our proposed method is effective for basketball event detection. 1. Introduction With the remarkable increase of multimedia data, accessing extensive amounts of multimedia data is becoming a necessity in an era of information explosion. As the amount and complexity of video documents grow, there is an evident need for more intelligent video browsing, searching and manipulating techniques. The automatic content-based video indexing problem which is the process of attaching content-based labels to video documents has attracted much attention [1] [2]. Although syntactic structure analysis provides the physical and logical description of the audiovisual content, it is lacking in semantic meaning, e.g. structure based on temporal events and spatial objects. To facilitate high-level abstraction and efficient content-based access, semantics extraction is becoming an important aspect of the multimediaunderstanding problem. However, semantic concepts tightly depend upon a specific application context. Sports video has inherent structural constraints specified by the rules of the game and field production. Due to these constraints, sports video is widely used to explore the interaction of low-level features, domain knowledge and high-level semantics. Currently, there is some work report in the domain of event identification for soccer video [3], baseball video [4] and tennis video [5]. In [3], Xu et.al presented a framework for soccer video structure analysis and event detection based on grass-area-ratio. Rui et al. [4] developed effective techniques to detect an excited announcers speech and baseball hits from noisy audio signals, and fuse them to extract exciting segments of baseball programs. In [5], Sudhir et al. presented their techniques on automatic analysis of tennis video to facilitate content-based retrieval, which is based on the generation of an image model for the tennis court-lines. In soccer video, audio keywords proved to be effective feature representations for extracting high-level semantics [6]. In tennis video, our proposed fusion scheme has been proved to be an efficient way for event detection [7]. In this paper, we will extend our audio keyword and fusion scheme to basketball event detection. There is some research work about basketball video [8] [9]. Tan et al. developed [8] a basketball annotation system which combines the camera motion recovered from an MPEG video stream with the prior knowledge of basketball structure to provide high level content analysis, annotation and browsing for events such as wide-angle and close-up views, fast breaks, probable shots at the basket, etc. However, they didn t consider auditory features. In [9], Nepal et al. identified goal segments in a basketball video using 5 temporal goal models which are constrained by the observation of crowd cheer, scoreboard display and change in direction. Nepal s work shown model based method for event detection is a feasible way. Besides goal, there are other interesting events in basketball video. In this paper, we propose a basketball event detection method by using multi-modality to detect four structural events (section beginning, section ending, in play and out of play) and five regular events (jump ball at the beginning, foul, penalty, shot and goal) in basketball games. In Section 2, we briefly introduce our basketball event detection scheme. Section 3 presents structural event detection by mapping from semantic shot classes. Section 4 further describes the stages of regular events detection by heuristic aligning audio keywords and semantic shot classes. Some events detection results are listed in Section 5. In Section 6, we draw conclusions and discuss some future work. 2. System Flow Video annotation is often facilitated by prior knowledge of some general structure for the studied video. For tennis video analysis, we considered sound transition patterns as an important prior knowledge. By using prior knowledge,
we had successfully identified five typical sound transition patterns in court-view shot to detect five interesting events in tennis video [7]. Also, audio sound was shown to provide some potential linkage of interesting to events in soccer video [6], although the result of event detection only by checking audio sounds was not very satisfactory. In this paper, we combine the methods we used for event detection in tennis [7] and soccer [6] to construct our basketball event detection scheme. According to basketball prior knowledge, we classify basketball events into two categories, namely, structural events and regular events. Structural events refer to the events related to game structure such as the beginning and the ending of sections, in play and out of play. Regular events mean frequent occurrences, which strongly depend on sports game rules, such as jump ball, foul, penalty, shot goal and so on (See section 3). One basketball game includes four sections. One section lasts 20 minutes. Figure 1 show the causal effect among investigated events in a basketball section. In play: In play refers to when a ball is within the boundaries of the field and play has not been stopped by the referee. Out of play: out of play refers to when a ball is outside the boundaries of the field or play has been stopped by the referee. Section ending Structural Events Heuristic Mapping Source Basketball Game Video Stream Semantic Shot Classification Streams Split Semantic Shot Class Audio Stream Audio Keywords Creation Audio Keywords Heuristic Alignment Regular Events Figure 2: Basketball Event Detection System Flow Figure 1: Investigated Basketball Event Structure in a Basketball Section Figure 2 illustrates the logical flow of this scheme. For visual part, we first perform semantic shot classification, in which we heuristic map semantic shot classes to structural events by using domain-specific knowledge. For audio part, we perform audio keywords creation followed by heuristic alignment with semantic shot class to generate rules for detecting regular events. As shown in Figure 2, there are two main stages: mid-level feature representation and event inference. At the first stage, semantic shot classification and audio keywords creation produce mid-level visual and auditory features, i.e. semantic shot classes and audio keywords. At the second stage, we heuristically map semantic shot classes to structural events and align audio keywords with semantic shot class to infer regular events. The definition of structural events or regular events and more details of our system are provided in Section 3 and 4. 3. Structural Events Detection by Semantic Shot Classification Currently, we predefine four structural events for basketball games as following [10]: Section beginning Figure 3: Semantic Shot Classes in a typical Basketball Video Compared with the method of in play and out of play detection in soccer game [3], our method discards courtcolor method, because using court-color to detect in play and out of play is very limited by camera angle. Sometimes, the accuracy is very low. Instead of court-color method, we
try to find some linkage between semantic shot classes and structural events. Video shots sequences convey the progress of the game, which closely follow actions in the field and emphasize the exciting moment. Based on the sports video production rules, we summarize the following potential linkages between shot class and structural events: After shot, goal or some other highlight, photographs usually close-up to players or audience. Such as the shot of audience, mediumview and close-up. Replay shot is used for recurring exciting moment. During in play time, camera follows the action. Such as the shot of full-court-advance and penalty. To help understanding, we list these linkages along with percentage of each shot class in Figure 3. We develop some semantic rules to identify the beginning and the ending of basketball game sections. That is: In Bird-View shot If (The following shot is Still-Court Shot) The beginning of basketball game section; Else The end of basketball game section; In [11], we had proposed a uniform framework for semantic shot classification and achieved very promising results for five types of sports videos. 4. Regular Events Detection by Heuristic Aligning Audio Keywords and Semantic Shot Class Using only shot class information is not enough to detect regular events accurately. We resort to other information models which contain abundant semantic hints. Audio information in sports video plays an important role in semantic event detection. Compared to the research done on sports video analysis using visual information, hardly any work has been done on sports video analysis using audio information [4] [12]. Without visual information, their work is only using audio features for event detection. 4.1 Basketball Regular Events Definitions Besides structural events, there are some other interesting events which happened regularly in a basketball game [10]. Jump Ball at the beginning: A method of putting the ball back into play at the beginning of the section. Two opponents face one another in one of the three restraining circles, the referee tosses the ball up between them, and they try to tip it to a teammate. All other players must remain outside the circle until the ball is touched. Foul: An infraction of the rules by a player, coach, or official that is not a violation. Penalty: An extra free throw awarded after the opposing team has exceeded a certain limit. Shot: An attempt to score in a game. Goal: The score awarded for such an act. 4.2 Basketball Audio Keywords Creation As a kind of mid-level representation, audio keywords refer to a set of game-specific audio sounds with strong relationships to the actions of players, referees, commentators and audience. We have attempted to create audio keywords for tennis and soccer video to detect interesting events by different audio low-level feature sets. In this paper, we define six audio keywords for basketball game which are whistling, ball hitting backboard or basket, excited commentator speech or audience sound and plain commentator speech or audience sound. 4.2.1 Low-level Feature Selection In basketball video, the audio signal mainly comes from speech and sounds of commentator, audience, whistling, and environment. Therefore, we first extract some lowlevel features that are successfully used in speech analysis and then experiment whether they can provide good results for audio signal analysis in basketball video. Our extracted features, Zero-Crossing Rate (ZCR), Spectral Power (SP), Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction Coefficients (LPC), Short-Time Energy (STE) and Linear Prediction Cepstral Coefficients (LPCC), highlight both time domain and frequency domain properties of the audio signals. To select suitable features for the classification of basketball audio keywords, we make use of a single-layer support vector machine (SVM) classifier [13] to evaluate the performance of a single feature in the classification. In our experiments, we use two-hour data of basketball videos which have been divided into two equal sizes for training and testing. We summarize the results in Table 1. From Table 1, we find that whistling and ball hitting backboard or basket are easy distinguished by LPCC. Since the noise level of both commentator speech and audience s sound is high, individual feature cannot achieve good classification results for these two classes. By combining LPCC and MFCC which obtain better results for the classification of audience sound and commentator speech, the error rate is reduced to 13.34%. In most cases, to identify the segments with excited commentator speech and excited audience sound is much more significant than classify audience sound and commentator speech. Because the excited sound coincides with an exciting event such as shot, goal and so on. Hence, through feature comparison experiments, we use LPCC, MFCC and ZCR which work better than other low-level features for excited and plain commentator speech classification, and ZCR, MFCC and SP for
classification excited and plain audience sound. Performance of audio keywords creation is shown in Table 2. Table 1: Performance Comparison of Low-level Features in the Classification Error Rate LPCC LPC MFCC SP ZCR STE (%) Whistling 0.545 3.542 3.134 7.357 16.76 91.55 Hitting 0.8174 2.452 3.815 9.264 7.902 7.766 Acclaim 13.35 21.39 20.98 35.42 25.34 63.76 Commentator 17.44 22.89 20.3 22.75 20.03 79.97 Table 2: Performance Comparison of Combining Features in the Classification Acclaim & Commentator Speech LPCC & ZCR LPCC & MFCC Error Rate (%) 20.33 13.34 4.2.2. Hierarchical SVM Classifier Based on the above analysis, we propose a three-layer hierarchical SVM classifier as shown in Figure 5. The 2 kernel functions are K ( x, y) = exp( x y c), withc = 0.5. Note that sound itself has a continuous existence and human normally makes decision about sound characteristics within a certain time segment. Hence, we exploit the sliding window technique to vote the sound type from a sequence of frame-based classification results. Our framework of audio keywords creation is shown in Figure 3. Table 3: Performance of Basketball Audio Keywords Creation WH BHB EC PC EA PA Error Rate (%) 0.55 0.82 21.56 20.91 19.86 16.29 WH: Whistling; BHB: Ball Hitting Basketball or Basket; EC: Excited Commentator Speech; PC: Plain Commentator Speech; EA: Excited Audience; PA: Plain Audience We achieved a very low error rate for whistling keyword. Not surprisingly, the whistling includes strong semantic hints. Compared with commentator speech and audience sound, the classification of quite game-specific sounds (e.g. hitting and whistling) has lower error rate. In fact, the excited and plain are two very subjective concepts, whose higher error rate will affect the highlight detection only. On the other hand, these quite game-specific sounds will greatly facilitate event detection by providing an excellent supporting role. 4.3 Regular Events Detection In this section, we detect five interesting basketball regular events which are jump ball at the beginning, foul, penalty, shot and goal. Because these events are interested by audience and very related to the situation or score of basketball game, they can provide labels for semantic indexing in basketball videos. To detect these five events, we summarize some heuristic decision rules as following: W histling Audio Signals SVM1 (LPCC) Throwing Commentator speech Excited SVM3 (MFCC, LPCC, ZCR) SVM2 (LPCC & ZCR) Plain Audience sound Excited SVM4 (MFCC, SP, ZCR) Plain Figure 3: The framework of audio keywords creation 4.2.3. Audio Keywords Creation Results The two hours audio samples were collected with 44.1 khz sample rate, stereo channels and 16 bits per sample. The audio signals were segmented at 20ms/frame, which is the basic unit for feature extraction. Based on observation, in most time, the sounds of some kinds of classes such as commentator speech and audience s sound are mixed together in basketball video. We label every sample by the class of which the sound occupies the dominant role. We used two third of these samples for training and one third for testing. The results of keywords creation are shown in Table 3. Switch (shot-class) Case Still-Court shot: If (Audience sound and Commentator Speech) Jump ball at the beginning; Break; Case Close-up shot: If (The following shot is penalty shot) Before Penalty; Else After throwing or highlight; Break; Case Full-Court shot If (Whistling) Foul; Else if (Ball hitting the rim or basket or excited audience sound and commentator speech) Shot; If (Excited audience sound and commentator) Goal; Case Penalty shot Penalty If (Whistling) Foul; Else if (Ball hitting the rim or basket or excited audience sound and commentator speech) Shot; If (Excited audience sound and commentator) Goal;
According to these heuristic rules, we can detect some interesting events with strong semantic meaning in basketball video. Some results of event detection are shown in Table 4. 5. Events Detection Results By the below, we get encouraging event detection results for 9 basketball interesting events which classified into 4 structural events and 5 regular events. Since identification of the structural events completely rely on semantic shot classification, the accuracy of event detection directly depends on the performance of semantic shot classification. The accuracy is above 90~95%. The results of regular events detection are listed in Table 4. Table 4: Performance evaluation of regular events detection in basketball videos (80 minutes) JBAB Foul Penalty Shot Goal Ground Truth 4 48 32 192 120 No. of Misses 0 3 1 11 6 No. of False 0 1 2 24 15 JBAB: Jump Ball at the Beginning 6. Conclusion Remarking In this paper, we achieved basketball event detection by our proposed method. This work is an extension of audio keywords used in soccer [6] and fusion scheme used in tennis [7]. Compared with other work of basketball semantics mining, our method provided a novel way to create audio keywords combining with semantic basketball video shots and further with the help of domain specific knowledge to detect interesting events. This method is robust to achieve satisfied results for event detection in basketball videos. Moreover, the method attempts a new way to align video and audio simply by synthesizing both auditory and visual mid-level features which come from low-level features analysis. This helps avoid complex work of video and audio stream alignment. Clearly, this method is open and extensible. So far, we have tested this method on tennis, soccer and basketball games. And we plan to extend to the other sports domain to find a generic solution for sports video analysis. References [1] S. W. Smoliar and H. Zhang, Content-based video indexing and retrieval, In Transaction of IEEE Multimedia 1(2): 62-75, 1994. [2] A. Yoshitaka et al. Knowledge-assisted content-based retrieval for multimedia databases, In Transaction of IEEE Multimedia, 1(4): 12-20, 1994. [3] P. Xu, L. Xie, S-F Chang, A. Divakaran, A. Vetro, and H. Sun, Algorithms and Systems for Segmentation and Structure Analysis in Soccer Video, In Proc. of IEEE International Conference on Multimedia and Expo, Tokyo, Japan, Aug. 22-25, 2001. [4] Y. Rui, A. Gupta and A. Acero, Automatically Extracting Highlights for TV Baseball Programs, In Proc. of ACM Multimedia, Los Angeles, CA, pp. 105-115, 2000. [5] G. Sudhir, J. C. M. Lee, and A. K. Jain, Automatic Classification of Tenis Video for High-level Content-based Retrieval, In Proc. of IEEE International workshop on Content-Based Access of Image and Video Database, 1998. pp. 81-90 [6] M. Xu, N. C. Maddage, C. Xu, M. Kankanhalli, and Q. Tian, Creating Audio Keywords for Event Detection in Soccer Video, In Proc. of IEEE International Conference on Multimedia and Expo, Baltimore, Maryland, Jul. 6-9, 2003. [7] M. Xu, L. Duan, C. Xu and Q. Tian, A Fusion Scheme of Visual and Auditory Modalities for Event Detection in Sports Video, In Proc. of IEEE International Conference on Acoustics, Speech, & Signal processing, Hong Kong, China, April, 2003. [8] Y.-P. Tan, D. D.Saur, S. R.Kulkarni, and P. J. Ramadge, Rapid Estimation of Camera Motion from Compressed Video with Application to Video Annotation, IEEE Transactions on Circuits and Systems for Video Technology 10(1): 133-146, 2000. [9] S. Nepal, U. Srinivasan, and G. Reynolds, Automatic Detection of Goal Segments in Basketball Videos, In Proc. of ACM Multimedia 2001. [10] Basketball Glossary http://www.hickoksports.com [11] L.-Y. Duan, M. Xu, and Q. Tian, Semantic Shot Classification in Sports Video, In Proc. of IEEE SPIE Storage and Retrieval for Media Database 2003. [12] D. Zhang and D. Ellis, Detecting Sound Events in Basketball Video Archive, Technical Report, Columbia University,https://www.ctr.columbia.edu/~dpwe/courses/e6 820-2001-01/projects/dpzhang.pdf. [13] V. Vapnik, Statistical Learning Theory, Wiley, 1998.