Automatic Closed Caption Detection and Filtering in MPEG Videos for Video Structuring



Similar documents
The transport performance evaluation system building of logistics enterprises

Software Engineering and Development

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Chapter 3 Savings, Present Value and Ricardian Equivalence

An Approach to Optimized Resource Allocation for Cloud Simulation Platform

The Detection of Obstacles Using Features by the Horizon View Camera

ON THE (Q, R) POLICY IN PRODUCTION-INVENTORY SYSTEMS

Tracking/Fusion and Deghosting with Doppler Frequency from Two Passive Acoustic Sensors

9:6.4 Sample Questions/Requests for Managing Underwriter Candidates

Reduced Pattern Training Based on Task Decomposition Using Pattern Distributor

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

An Introduction to Omega

Ilona V. Tregub, ScD., Professor

Continuous Compounding and Annualization

INITIAL MARGIN CALCULATION ON DERIVATIVE MARKETS OPTION VALUATION FORMULAS

Cloud Service Reliability: Modeling and Analysis

Concept and Experiences on using a Wiki-based System for Software-related Seminar Papers

HEALTHCARE INTEGRATION BASED ON CLOUD COMPUTING

Comparing Availability of Various Rack Power Redundancy Configurations

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

A framework for the selection of enterprise resource planning (ERP) system based on fuzzy decision making methods

The Role of Gravity in Orbital Motion

Supplementary Material for EpiDiff

Comparing Availability of Various Rack Power Redundancy Configurations

Review Graph based Online Store Review Spammer Detection

High Availability Replication Strategy for Deduplication Storage System

Mining Relatedness Graphs for Data Integration

Strength Analysis and Optimization Design about the key parts of the Robot

Deflection of Electrons by Electric and Magnetic Fields

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Optimizing Content Retrieval Delay for LT-based Distributed Cloud Storage Systems

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

Pessu Behavior Analysis for Autologous Fluidations

Research on Risk Assessment of the Transformer Based on Life Cycle Cost

Distributed Computing and Big Data: Hadoop and MapReduce

Modal Characteristics study of CEM-1 Single-Layer Printed Circuit Board Using Experimental Modal Analysis

YARN PROPERTIES MEASUREMENT: AN OPTICAL APPROACH

UNIT CIRCLE TRIGONOMETRY

Data Center Demand Response: Avoiding the Coincident Peak via Workload Shifting and Local Generation

Gravitational Mechanics of the Mars-Phobos System: Comparing Methods of Orbital Dynamics Modeling for Exploratory Mission Planning

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

Manual ultrasonic inspection of thin metal welds

Modeling and Verifying a Price Model for Congestion Control in Computer Networks Using PROMELA/SPIN

arxiv: v1 [q-fin.st] 12 Oct 2011

Online Learning of Region Confidences for Object Tracking

A Capacitated Commodity Trading Model with Market Power

A Hybrid DCT-SVD Video Compression Technique (HDCTSVD)

FIRST EXPERIENCES WITH THE DEFORMATION ANALYSIS OF A LARGE DAM COMBINING LASERSCANNING AND HIGH-ACCURACY SURVEYING

CONCEPTUAL FRAMEWORK FOR DEVELOPING AND VERIFICATION OF ATTRIBUTION MODELS. ARITHMETIC ATTRIBUTION MODELS

Office of Family Assistance. Evaluation Resource Guide for Responsible Fatherhood Programs

Carter-Penrose diagrams and black holes

Instituto Superior Técnico Av. Rovisco Pais, Lisboa virginia.infante@ist.utl.pt

Power Monitoring and Control for Electric Home Appliances Based on Power Line Communication

Electricity transmission network optimization model of supply and demand the case in Taiwan electricity transmission system

An Infrastructure Cost Evaluation of Single- and Multi-Access Networks with Heterogeneous Traffic Density

Episode 401: Newton s law of universal gravitation

THE DISTRIBUTED LOCATION RESOLUTION PROBLEM AND ITS EFFICIENT SOLUTION

UNIVERSIDAD DE CANTABRIA TESIS DOCTORAL

Real Time Tracking of High Speed Movements in the Context of a Table Tennis Application

Unveiling the MPLS Structure on Internet Topology

883 Brochure A5 GENE ss vernis.indd 1-2

An Efficient Group Key Agreement Protocol for Ad hoc Networks

Towards Automatic Update of Access Control Policy

STABILITY ANALYSIS IN MILLING BASED ON OPERATIONAL MODAL DATA 1. INTRODUCTION

Over-encryption: Management of Access Control Evolution on Outsourced Data

Exam #1 Review Answers

Evaluating the impact of Blade Server and Virtualization Software Technologies on the RIT Datacenter

FXA Candidates should be able to : Describe how a mass creates a gravitational field in the space around it.

The impact of migration on the provision. of UK public services (SRG ) Final Report. December 2011

Automatic Testing of Neighbor Discovery Protocol Based on FSM and TTCN*

Effect of Contention Window on the Performance of IEEE WLANs

INVESTIGATION OF FLOW INSIDE AN AXIAL-FLOW PUMP OF GV IMP TYPE

MULTIPLE SOLUTIONS OF THE PRESCRIBED MEAN CURVATURE EQUATION

Multiband Microstrip Patch Antenna for Microwave Applications

METHODOLOGICAL APPROACH TO STRATEGIC PERFORMANCE OPTIMIZATION

Firstmark Credit Union Commercial Loan Department

PAN STABILITY TESTING OF DC CIRCUITS USING VARIATIONAL METHODS XVIII - SPETO pod patronatem. Summary

Lab #7: Energy Conservation

Secure Smartcard-Based Fingerprint Authentication

Top K Nearest Keyword Search on Large Graphs

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

Left- and Right-Brain Preferences Profile

Financial Planning and Risk-return profiles

Chris J. Skinner The probability of identification: applying ideas from forensic statistics to disclosure risk assessment

Trading Volume and Serial Correlation in Stock Returns in Pakistan. Abstract

SUPPORT VECTOR MACHINE FOR BANDWIDTH ANALYSIS OF SLOTTED MICROSTRIP ANTENNA

CRRC-1 Method #1: Standard Practice for Measuring Solar Reflectance of a Flat, Opaque, and Heterogeneous Surface Using a Portable Solar Reflectometer

Memory-Aware Sizing for In-Memory Databases

AN INTEGRATED MOBILE MAPPING SYSTEM FOR DATA ACQUISITION AND AUTOMATED ASSET EXTRACTION

An Epidemic Model of Mobile Phone Virus

Timing Synchronization in High Mobility OFDM Systems

Semipartial (Part) and Partial Correlation

An application of stochastic programming in solving capacity allocation and migration planning problem under uncertainty

Magnetic Bearing with Radial Magnetized Permanent Magnets

The Supply of Loanable Funds: A Comment on the Misconception and Its Implications

Design of Wind Energy System on the Building Tower Applications

LARGE DOMAIN SATELLITE BASED ESTIMATORS OF CROP PLANTED AREA

Self-Adaptive and Resource-Efficient SLA Enactment for Cloud Computing Infrastructures

who supply the system vectors for their JVM products. 1 HBench:Java will work best with support from JVM vendors

Nontrivial lower bounds for the least common multiple of some finite sequences of integers

Transcription:

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 1145-116 (006) Automatic Closed Caption Detection and Filteing in MPEG Videos fo Video Stuctuing Depatment of Compute Science and Infomation Engineeing National Chiao Tung Univesity Hsinchu, 300 Taiwan E-mail: {dychen; mhhsiao; sylee}@csie.nctu.edu.tw Video stuctuing is the pocess of extacting tempoal stuctual infomation of video sequences and is a cucial step in video content analysis especially fo spots videos. It involves detecting tempoal boundaies, identifying meaningful segments of a video and then building a compact epesentation of video content. Theefoe, in this pape, we popose a novel mechanism to automatically pase spots videos in compessed domain and then to constuct a concise table of video content employing the supeimposed closed captions and the semantic classes of video shots. Fist of all, shot boundaies ae efficiently examined using the appoach of GOP-based video segmentation. Colo-based shot identification is then exploited to automatically identify meaningful shots. The efficient appoach of closed caption localization is poposed to fist detect caption fames in meaningful shots. Then caption fames instead of evey fame ae selected as tagets fo detecting closed captions based on long-tem consistency without size constaint. Besides, in ode to suppot disciminate captions of inteest automatically, a novel tool font size detecto is poposed to ecognize the font size of closed captions using compessed data in MPEG videos. Expeimental esults show the effectiveness and the feasibility of the poposed mechanism. Keywods: caption fame detection, closed caption detection, font size diffeentiation, video stuctuing, video segmentation 1. INTRODUCTION With the inceasing digital videos in education, entetainment and othe multimedia applications, thee is an ugent demand fo tools that allow an efficient way fo uses to acquie desied video data. Content-based seaching, bowsing and etieval is moe natual, fiendly and semantically meaningful to uses. The need of content-based multimedia etieval motivates the eseach of featue extactions of the infomation embedded in text, image, audio and video. With the technique of video compession getting matue, lots of videos ae being stoed in compessed fom and accodingly moe and moe eseaches focus on the featue extactions in compessed videos especially in MPEG fomat. Fo instances, edge featues ae extacted diectly fom MPEG compessed videos to detect scene change [5] and captions ae pocessed and inseted into compessed video fames [7]. Featues, like chominance, shape and textue ae diectly extacted fom MPEG videos to detect face egions [1, 3]. Videos in compessed fom ae analyzed and pased fo suppoting video bowsing [11]. Received May 13, 004; evised August 3, 004; accepted Septembe, 004. Communicated by Ming-Syan Chen. 1145

1146 Howeve, textual infomation is semantically moe meaningful and attacts inceasing eseaches on closed caption detection in video fames [, 4, 6, 1-17]. The eseaches [6, 1-14, 16, 17] detect closed captions in pixel domain. In [16, 17], they poposed to detect closed captions in specific aeas. Howeve, it is impactical to localize closed captions in specific aeas of a fame since in diffeent video souces closed captions nomally do not appea in a fixed position. A numbe of pevious eseaches extact closed captions fom still images and video fames [13-15, 7, 8] with a constaint that chaactes ae bounded in size. Besides, these appoaches usually equie the popety that text has a good contast fom the backgound. Howeve, text egion localization with size constaint is not pactical especially fo the cases that those captions ae small in size but ae vey significant and meaningful. Fo example, in spots videos, the supeimposed scoeboads show the intemediate esults between competitos and pesent the match as clealy as possible without intefeence. Thee has been vey little effot to extact featues in compessed domain to detect closed captions in videos. Zhong et al. [] and Zhang and Chua [4] detect lage closed captions fame-by-fame in MPEG videos using DCT AC coefficients to obtain textue infomation in I-fames without exploiting the tempoal infomation in consecutive fames. Howeve, it is impactical and inefficient to detect closed captions in each fame. Due to the tempoal natue of long-tem consistency of closed captions ove continuous video fames, it would be moe obust to detect the closed caption based on its spatial-tempoal consistency. Gagi et al. [15] pefom text detection by counting the numbe of inta-coded blocks in P and B fames based on the assumption that the backgound is static. Hence, it is vulneable to abupt and significant camea motion. Besides, this appoach is only applied to the P and B fames and does not handle captions that appea in the I-fames. In this pape, in ode to detect closed captions efficiently and flexibly, we popose an appoach fo compessed videos to detect caption fames in meaningful shots. Then caption fames instead of evey fame ae selected as tagets fo localizing closed captions without size constaint while consideing long-tem consistency of closed captions ove continuous caption fames fo emoving noise. Moeove, we popose a novel tool font size detecto to identify font size in compessed videos. Using this tool, afte the tageted font size is indicated, we can allow uses to automatically disciminate captions of inteest instead of captions in the pesumed position. It is woth noticing that font size ecognition is a citical step in the pocess of video OCR since a bottleneck fo ecognizing chaactes is due to the vaiation of text font and size [-5]. Theefoe, this tool can be used as a pefilte to quickly signal the potential caption text and thus educe the amount of data that needs to be pocessed. The poposed system achitectue is shown in Fig. 1. All the tasks ae accomplished in compessed domain. GOP-based video segmentation [8] is exploited to efficiently segment video into shots. The colo-based shot identification is poposed to automatically identify meaningful shots. Caption fames in these shots ae detected by computing the vaiation of DCT AC enegy both in the hoizontal and vetical diections. In addition, we detect closed captions using the weighted hoizontal-vetical DCT AC coefficients. To detect closed captions obustly, each candidate closed caption is veified futhe by computing its long-tem consistency that is estimated ove the backwad shot, the

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1147 MPEG- Video Steams GOP-Based Video Segmentation DC Vaiation Based Shot Selection Caption Fame Detection Closed Captions Font Size Diffeentiation Noise Filteing - Long-Tem Consistency Fig. 1. Oveview of the system achitectue. Text Caption Localization fowad shot and the shot itself. Afte closed captions ae obtained, we diffeentiate the font size of each closed caption based on hoizontal pojection pofile of DCT AC enegy in the vetical diection. Captions of inteest can then be identified by the font size and size vaiance. Finally, captions of inteest and the meaningful shots can be employed togethe to constuct a high-level concise table of video content. The est of the pape is oganized as follows. Section descibes the colo-based shot identification. Section 3 pesents the poposed appoach of closed caption localization. Section 4 shows the expeimental esults and the pototype system of video content visualization. The conclusion and futue woks ae given in section 5..1 Video Segmentation. SHOT IDENTIFICATION Video data is segmented into clips to seve as logical units called shots o scenes. In MPEG- fomat [9], GOP laye is a andom accessed point and contains GOP heade and a seies of encoded pictues including I, P and B-fame. The size of a GOP is about 10 to 0 fames, which is less than the minimum duation of two consecutive scene changes (about 0 fames) [10]. Instead of checking fame-by-fame, we fist detect possible occuences of scene change GOP-by-GOP (inte-gop). The diffeence between each consecutive GOP-pai is computed by compaing the coesponding I-fames. If the diffeence of DC coefficients between these two I-fames is lage than the theshold, then thee might exist scene change in between these two GOPs. Hence, the GOP that might contain the scene change fames is located. In the second step inta GOP scene change detection, we futhe use the atio of fowad and backwad motion vectos to find out the actual fame of scene change within a GOP. By this appoach, the expeimental esults [8] ae encouaging and pove that the scene change detection is efficient fo video segmentation.. Shot Identification While the bounday of each shot is detected, the video sequence is segmented into shots consisting of the advetisement, close-up and cout-view. Closed captions can then be detected in each video shot. Howeve, it is impactical to detect closed captions in all video shots. In spots videos, the shots of cout-view ae ou focus since the matches of the spots ae pimaily shown in the shots of cout-view and the scoeboads ae pe-

1148 sented mostly in these kinds of shots. Theefoe, scene identification appoach is poposed to identify the shots of cout-view. To ecognize the shots of cout-view, it is woth noticing that the vaiation of the intensity in the cout-view fames is vey small though a whole clip and the value of intensity vaiation between consecutive fames is vey simila. In contast, the intensity of the advetisement and close-up vaies significantly in each fame and the diffeence of the vaiance of intensity between two neighboing fames is elatively lage. Theefoe, the intensity vaiation within a video shot can be exploited to identify the shots of cout view. In ode to efficiently obtain the intensity vaiance of each fame and that of a video shot, DC-images of I-fames ae extacted to compute the intensity vaiance. The DC fame vaiance FVa s, i and the shot vaiance SVa s ae defined by and N N DC si, = i, j/ DCi, j/ N, (1) j= 1 j= 1 FVa DC N M M DC DC SVas = ( FVas, i ) / M FVasi, / M, i= 1 i= 1 () whee DC i,j denotes the DC coefficient of the jth block in the ith fame, N epesents the total numbe of blocks in a fame, and M denotes the total numbe of fames in shot s. Based on the fact that the intensity vaiance of a cout-view fame is vey small though a whole clip, shots ae egaded as the type of cout-view Shot Cout by DC {, δ and δ, [1, ]} Shot = Shot FVa < SVa < i N (3) Cout s s i fame s shot whee δ fame and δ shot ae the pedefined thesholds. In ode to demonstate the applicability of the poposed shot identification, the vaiation of the intensity vaiance of each I-fame in spots videos including tennis, football and baseball is exhibited in Fig.. Fig. (a) shows a tennis video composed of fou tennis cout shots, thee close-up shots and a commecial shot. Fig. (b) intoduces a football sequence consisting of close-up shots and football field shots. A baseball sequence is pesented in Fig. (c) including pitching shots, baseball field shots and closeup shots. Fom Fig., we can obseve that the intensity vaiance of the type of coutview is vey small and the value is vey simila though a whole clip. Thus, the clips of cout-view can be indicated and selected by the chaacteistic that the value of intensity DC vaiance FVa s, i is small in each individual fame and is consistent ove the whole shot. Theefoe, the poposed appoach of shot identification can be applied to identify cout-view shots of spots videos, in which the view of a match consists of the intensity-consistent backgound of a cout o athletic field. 3. CLOSED CAPTION LOCALIZATION In this section, we shall elaboate how to detect caption fames and how to detect closed captions in caption fames. To avoid the time-consuming ovehead of closed

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1149 (a) (b) (c) Fig.. Vaiation of I-fame DC value (a) tennis; (b) football; (c) baseball. caption examination fame-by-fame, caption fames should be detected fist. The details of caption fame detection ae descibed in section 3.1 and the appoach of closed caption localization is shown in section 3.. Section 3.3 pesents the appoach of font size diffeentiation. 3.1 Caption Fame Detection Caption fame detection is an essential step fo closed caption localization because captions may disappea in some fames and then appea subsequently. Theefoe, to avoid detecting closed captions fame-by-fame, we fist identify the possible fames in which captions might be pesent. Howeve, the caption size of closed captions in the shots of cout-view is usually vey small. Unde this cicumstance, the change of the AC

1150 enegy of the entie fame with the appeaance o disappeaance of the small caption would not esult in significant vaiation. It means that the vaiance of the AC enegy obtained fom an entie fame cannot be used as a measuement of the possibility of the pesence of a small caption. In ode to obustly detect closed captions without size constaint, each I-fame is divided into an appopiate numbe of egions (say R). Howeve, the size of a egion should be modeate to eflect the actual vaiation of appeaance o disappeaance of small captions. If the size of a divided egion wee too small, any slight change of colo o textue would incu quite pominent vaiation of AC enegy. Accodingly, in ode to detect the appeaance of supe-imposed closed captions in fou cone aeas as well as in the middle of a fame, the numbe of egions R hee can be set to six. Based on the fame division method, the vaiance RVa si, of AC coefficients of each egion in the ith fame of shot s is computed by N N si, = h/ v, j/ ACh/ v, j/ N, = 1,,, R, (4) j= 1 h/ v j= 1 h/ v RVa AC N whee AC h/v,j denotes the hoizontal AC coefficients fom AC 0,1 to AC 0,7 and the vetical AC coefficients fom AC 1,0 to AC 7,0 in egion and N is the total numbe of blocks in egion. Using the enegy vaiance RVa si, of each egion, the method poposed to detemine caption fames is illustated as follows: Fo each egion, (5) If Diff ( RVas, i+ 1, RVas, i) δ, captions may disappea in fame (i + 1) If Diff ( RVas, i+ 1, RVas, i) δ, captions may appea in fame (i + 1) whee RVa s, i+ 1 RVas, i RVas, i+ 1 RVas, i gion- in fame i + 1 is compaed with si, between RVa s, i + 1 and si, Diff (, ) =. In the method, RVa s, i + 1 of e- RVa of egion- of fame i. If the diffeence RVa is lage than a theshold δ (3000), it means the textue of egion- in fame i + 1 is moe complex than that of egion- in fame i, i.e., closed captions may be supeimposed in fame i + 1. Similaly, if the diffeence between RVa s, i + 1 and RVa si, is smalle than the theshold δ, the textue of egion- in fame i + 1 becomes less complex than that of egion- in fame i, i.e., closed captions in fame i may disappea in fame i + 1. An example of caption fame detection is demonstated in Fig. 3. Fig. 3 shows the detection of caption fames with small closed captions pesented. We can see that the cuve of DCT AC vaiance RVa si, of egion-1 dops abuptly in the 18 th I-fame and ises in 39 th I-fame since the scoeboad disappeas fom the 18 th I-fame to the 38 th I-fame in the aea of egion-1 and then appeas again in the 39 th I-fame. 3. Closed Caption Localization While the caption fames ae identified, we then locate the potential caption egions in these fames by utilizing the gadient enegy obtained fom the hoizontal and vetical

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1151 Fig. 3. Demonstation of caption fame detection. DCT AC coefficients. We can obseve the fact that closed captions geneally appea in ectangula fom and the AC enegy in the hoizontal diection would be lage than that in the vetical diection since distance between chaactes is faily small and the distance between two ows of text is elatively lage. Theefoe, we assign highe weight to hoizontal coefficients than that to vetical coefficients. The weighted gadient enegy of an 8 8 block E used as a measuement fo evaluating the possibility of a text block can be defined as follows: h h v v E = ( w E ) + ( w E ), (6) E = AC, h1= 1, h= 7, h h1 h h v1 v v 0, h E = AC, v1= 1, v= 7. v v,0 If the enegy E of a block is lage than a pedefined theshold, this block is egaded as a potential text block. Othewise, the block would be consideed as a non-text block and be filteed out without futhe pocessing. Besides, in ode to save computation cost, we select only 3 I-fames (fist, middle and last) as epesentative fames in a shot fo closed caption localization. The esult of closed caption localization is demonstated in Fig. 4 with w h set to 0.7 and w v to 0.3. Although the scoeboad and the tademak in Fig. 4 (b) in the uppe pat of the fame ae all located and indicated, caption egions ae fagmentay and some noisy egions emain. Theefoe, we adopt a mophological opeato in the size of 1 5 blocks to mege fagmentay text egions and the esult is demonstated in Fig. 3 (c). Aftewad, the meged text egions ae futhe veified by computing the long-tem consistency. Fo long-tem consistency checking, we select anothe two I-fames as tempoal

115 (a) (b) (c) (d) Fig. 4. Illustation of intemediate esults of closed caption localization (a) Oiginal fame (b) Closed caption detection; (c) Result afte applying mophological opeation (d) Result afte long-tem consistency veification. efeence, the last I-fame of the fowad shot (P F ) and the fist I-fame of the backwad shot (F B ), whee T f, T m and T ae the fist, middle and the last I-fames of the specific shot. One possible measuement of the long-tem coheence of text blocks in potential egions is to check if the text blocks of a potential caption egion appea moe than half of the time in a shot. That is text blocks appea in moe than o equal to thee times among the five epesentative five I-fames. Hee, we exploit the position, intensity and textue infomation of potential text blocks among these epesentative I-fames (P F, T f, T m, T and F B ) to measue the tempoal coheence as defined by k = 1 ( DC DC)( E E) Bk C =, 1 C 1 Bk ( DCBk DC) ( EB E) k k= 1 k= 1 (7) whee DC Bk denotes the value of DC coefficient of B k, DC is the aveage of DC Bk and DC Bk+1, E Bk epesents the weighted gadient enegy E of B k as defined in Eq. (6) and E is the aveage of E Bk and E Bk+1. A block is chaacteized by its intensity epesented by the DC coefficient and also by its textue obtained fom AC coefficients. We compute the coelation C to measue the similaity between two blocks B k and B k+1, which ae in the same coesponding position in thei espective fame i and fame i + 1. If a value C of a block pai is lage than δ C, these two blocks ae egaded as the same. To estimate the tempoal coheence of potential text blocks, we need to compute the pai wise coelation C fou times among the 5 epesentative I-fames. Theefoe, a text block is long-tem consistent in the specific video shot only when moe than half of the times the pai wise I-fames coelation C is lage than δ C. The esult of long-tem consistency checking of text blocks is demonstated in Fig. 4 (d). We can see that the scoeboad and the tademak ae all successfully localized and most of the noise is emoved. The poposed closed caption localization can also be applied to othe kinds of videos such as baseball, news and volleyball as demonstated in Fig. 5. In Fig. 5 (a), we can obseve that the closed caption pimaily composed of Chinese chaactes is also localized coectly.

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1153 (a) (b) (c) Fig. 5. Examples of closed caption localization (a) baseball; (b) news; (c) volleyball. 3.3 Font Size Diffeentiation Fom Fig. 4 (d), we can notice that the scoeboad in the left uppe cone and the tademak in the ight uppe cone ae all successfully detected. Since scoeboads can be used fo the content stuctuing of spots videos, the issue of sepaating out the captions in the scoeboad is one of ou concens. Hence, the tool font size detecto is poposed to automatically disciminate the font size as a suppot in the discimination of scoeboads. To detect the font size, the gadient enegy of each text block is exploited. Since a block consisting of chaactes will have much lage gadient enegy than that of a block consisting of blank space, the distance between two chaacte blocks can thus be detemined by evaluating the distance between peak gadient values among blocks in a ow o column. It means that the font size can be evaluated by measuing the distance between blocks with peak gadient value (i.e., the peiodicity of peak values). The gadient enegy in the vetical diection instead of hoizontal diection is exploited since the blank space in between two text ows is geneally lage than that between two lettes and hence the vaiation of gadient enegy in the vetical diection would pesent in moe egula patten. 8 Block Column 8 B t B b w 0 w 1 Bovelap block Block Row Fig. 6. Ovelap-block is intepolated fom its two neighboing blocks B t and B b.

1154 In addition, to obtain obust peiodicity, we compute the DCT coefficients of the 8 8 ovelap-block between two neighboing blocks as defined in Eq. (8). A ovelap-block B ovelap-block shown in Fig. 6 compises lowe potion of the top neighboing 8 8 block B t and uppe potion of the bottom neighboing block B b, whee I w0 and I w1 ae the identity matix in the dimension of w0 w0 and w1 w1, espectively. Moe obust esults would be achieved if moe ovelap-blocks ae computed and exploited. Fo example, w0 and w1 can be espectively set to 1 and 7, and 6, 3 and 5, etc. to acquie moe ovelap-blocks fo moe accuate estimation of font size. 0 0 0 0 B = B + B ovelap block t b 0 Iw0 Iw1 0 (8) 1 Peiodicity: T = Vaiance: V = N T i N i= 1 N N T / / i N T i N i= 1 i= 1 Region Selection (Consistent Block Height) N Block Columns Peiodicity Estimation of Each Block Column Font Size Estimation in Each Block Column AC Enegy T i Block Numbe AC Enegy AC Enegy AC Enegy Block Column 1 Block Numbe Block Column Block Numbe... Block Column N Block Numbe [ N ] Block Column i, i 1, Aveage Peiodicity of Block Column i n 1 Ti = T i, j n j = 1 AC Enegy Ti,1 T i, n... 1 3 4 5 M Block Numbe Fig. 7. The poposed appoach of font size diffeentiation in compessed domain. Fig. 7 shows the poposed appoach of font size diffeentiation, in which the peiodicity and vaiance ae estimated fo each block column. Howeve, localized closed captions like the example in the top of Fig. 7 may not be complete in shape because some pieces with low gadient enegy ae filteed out. Theefoe, to achieve obust font size diffeentiation, a egion that foms a ectangula in the localized caption is detemined fo font size computation. Font size diffeentiation is pefomed on each block column in the selected egion of the closed caption, whee a block column depicted in Fig. 7 is defined as a whole column of blocks. While the AC enegy of each block is extacted, the cuve of the vaiation of AC enegy fo each block column is checked to locate each local maximum. We can obseve that the egion containing the bounday of closed captions would have conspicuous textue vaiation in the vetical diection and the value of the

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1155 gadient enegy would be elatively high. Theefoe the local maximum of the cuve of vetical AC gadient enegy is egaded as the bounday of closed captions. While all local maximums ae ecognized, we must filte out noise and select eliable cuve peaks fo futhe veification. Due to the fact that the fist and the last local maximums usually eflect the bounday of closed captions, hence we select the fist and the last peaks of the cuve and compute the aveage of the value of these two peaks as the theshold adaptively fo noise filteing. If the value of a peak is smalle than the theshold, the peak is filteed out. Othewise, the peak is kept fo font size computation. Theefoe, the peiodicity of each block column T i is computed by aveaging the distance between two peaks of the cuve of AC enegy. Finally, the aveage peiodicity T and the peiodicity vaiance V of the closed caption ae obtained by and 1 N Ti N i = 1 T = (9) N N V = Ti / N Ti / N i= 1 i= 1 (10) whee N is the total numbe of block columns in the selected aea of the closed caption. In ode to estimate peiodicity of font size moe efficiently, we exploit the concept of the pojection analysis of a pint line [18, 19]. Since it can seve fo the detection of blank space between successive lettes, we thus compute the hoizontal pojection pofile P H of each block ow P y by summing up the vetical AC coefficients of the blocks. P H is defined as follows: W 1 7 PH = Py Py = ACv,0, 0 y < HT 1, ACv,0 Bx, y x= 0 v= 1 (11) whee H T is the summation of the numbe of oiginal blocks (H) and the numbe of ovelap-blocks (H 1) of a block column in an H W caption egion, and B x,y is a block of coodinate (x, y). By this method, we compute the peiodicity T of each localized closed caption once instead of inspection of the peiodicity T and of the vaiance V in each block column. The hoizontal pojection pofile of the scoeboad and the tademak is demonstated in Fig. 8, whee the aveage peiodicity T of the scoeboad and the tademak is about and 3, espectively. Using hoizontal pojection pofile, font size can be detected moe efficiently since one cuve of AC enegy vaiation needs to be computed fo a closed caption. 4. EXPERIMENTAL RESULTS AND VISUALIZATION SYSTEM 4.1 Expeimental Results In the expeiment, testing dataset consisted of fou kinds of videos including tennis,

1156 Total AC Enegy 5000 4000 3000 000 1000 Hoizontal Pojection Pofile Tademak Scoeboad 0 0 4 6 8 10 Row Numbe Fig. 8. Hoizontal pojection pofile of DCT AC enegy of the scoeboad and the tademak. baseball, volleyball and news. Two tennis videos selected fom US Open and Austalia Open, espectively wee ecoded fom the Sta-Spot TV channel. A volleyball video was ecoded fom ESPN TV channel and a baseball game was ecoded fom VL-Spot TV channel. A news video was selected fom MPEG-7 testing dataset. The testing sequences wee encoded in MPEG- fomat with the GOP stuctue IBBPBBPBBPBBPBB at 30 fps. The length of the fist tennis video and the news video was about 50 minutes, and the length of the second tennis video was about 30 minutes. The length of the volleyball video and the baseball video was about 40 minutes and 60 minutes, espectively. The gound tuth of the numbe of caption I-fames of tennis videos shown in Table 1 was 903 and 414, espectively. In Table, thee wee totally 4183 text blocks in the epesentative fames of tennis video 1 and totally 5680 text blocks of tennis video. The numbe of text blocks in baseball was lage than othe videos due to the lage supeimposed captions. The esults of caption fame detection and closed caption localization wee evaluated by estimating the pecision and ecall. The expeimental esult of caption fame detection was shown in Table 1, and the best pefomance was achieved in the fist tennis video. In tennis video 1, the ecall was up to 100% and the pecision was about 97%. Thee wee 6 fames of false detection due to the facto that the scoeboad was not pesented but some high-textue billboads appea with significant camea movement. In this case, we would detect lage vaiation in the egion whee billboads wee pesented. In tennis video, the pecision of caption fame detection was up to 98% and the ecall was about 93%. The numbe of fames of miss detection was 31 because of the low intensity of the scoeboad in this video sequence. Besides, the colo of the scoeboad and that of the tennis cout wee quite simila and hence it would be moe difficult fo caption detection in the case of low contast between closed captions and the backgound. The wost case in detecting caption fames was pesented in the baseball video since the backgound of seveal shot types was highly textued, such as the pitching shots and the audience shots. Theefoe, when the camea moved, high-textued egions would be consideed as the pesence of captions. Howeve, ecall ate in detecting caption fames in the baseball video emained moe than 80%. The esults of closed caption localization wee shown in Table. In tennis video 1 41030 text-blocks wee coectly detected, 347 blocks wee falsely detected and 395 text blocks wee missed. The pecision was about 99% and the ecall was about 97%. In tennis video, 464 text blocks wee detected, 73 blocks wee falsely detected and totally

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1157 Gound Tuth of Caption Fames Table 1. Pefomance of caption fame detection. Fames of Coect Detection Fames of False Detection Fames of Miss Detection Miss Rate Pecision Recall Tennis 1 903 903 6 0 0% 97% 100% Tennis 414 383 8 31 7% 98% 93% Volleyball 60 578 57 4 4% 91% 96% Baseball 1554 190 407 64 4% 76% 83% News 960 873 113 87 9% 88% 91% Aveage 90% 93% Table. Pefomance of closed caption localization afte caption fame detected. Gound Tuth of Text Block Blocks of Coect Detection Blocks of False Detection Blocks of Miss Detection Miss Rate Pecision Recall Tennis 1 4183 41030 347 395 1% 99% 97% Tennis 5680 464 73 1056 4% 97% 95% Volleyball 81 7353 4087 50 8% 87% 9% Baseball 96405 6979 6370 6676 9% 81% 91% News 01616 19016 373 10080 5% 89% 95% Aveage 91% 94% 1056 text blocks wee missed. Hence, the pecision and ecall of tennis video was 97% and 95%, espectively. Some text blocks wee missed since the backgound of the closed caption was tanspaent and would change with the backgound while camea moved. In this case, if the textue of the backgound was simila to the closed caption, the lettes of captions cannot eflect the lage vaiation in gadient enegy and some text blocks would be missed. The pecision ate of the baseball video in detecting text blocks was 81% due to the highly textued backgound. Howeve, the ecall ate was up to 9% since the tempoal consistency was exploited to filte noise. Most of the blocks, which appeaed fo a shot duation and thei spatial positions wee not consistent, wee egaded as noise and wee thus eliminated. The good pefomance was due to the eason that the weighted hoizontal-vetical AC coefficients wee exploited and the long-tem consistency of the closed caption ove consecutive fames was consideed.

1158 By applying the poposed appoach of font size diffeentiation, we can automatically disciminate the font size eithe in a closed caption o in diffeent ones. Theefoe, this designed tool can be used as the closed caption filte to ecognize and select those of inteest, once the use indicates the tageted font size of closed captions. Moeove, eseaches [-5] focusing on video OCR indicate that a bottleneck fo ecognizing chaactes was due to the vaiation of text font and size. In addition, to make leaning data fo the filte of chaacte extaction, the size of the filte, which was defined to include a line element of chaactes, should be detemined. Since the size of the line element stongly depends on the font size, it was possible to design a filte that can enhance the line elements dynamically with widely vaying font sizes when the font size in the localized captions was known. Consequently, the tool font size diffeentiation can be exploited to be a pe-pocessing tool fo video OCR. 4. The Pototype System of Video Content Visualization With the successful localization of the supe-imposed scoeboad in spots videos, video content can be visualized in a compact fom by constucting the hieachical stuctue. Taking tennis as an example, the stuctued contents composed of scoeboads and the elated can be combined with the detected tennis semantic events [0], such as baseline ally, seve and volley and passing shot. Each competition shots can be annotated using the type of coesponding event and can be labeled exploiting the scoeboad. Consequently, the infomation of the type of events, the bounday of events, the key fame of events and the esult of the event the scoeboad can be used in the Highlight Level Desciption Scheme shown in Fig. 9 to suppot uses to efficiently bowse videos by viewing the images of scoeboads and the impotant text infomation of semantic events. The name of highlight coesponded to the type of tennis event, the descipto of video segment locato was descibed by the event bounday and the position of the key fame in the video sequence was used fo the key image locato. The key image locato fo scoeboad indicates the time point in the video sequence. HieachicalSummay 1,* 0,* HighlightLevel 1,* HighlightSegment 0,1 VideoSegmentLocato 1,* 1,* KeyFameLocato Scoeboad KeyImageLocato Fig. 9. Hieachical summay desciption scheme [1]. The table of video content was composed of the oiginal video sequence in the top level, the scoeboad of a set, the scoeboad of a game and the key fame of one point.

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1159 The use inteface of the pototype system was shown in Fig. 10 and two aeas of Playback and Visualization wee pesent in the left and the ight side, espectively. Initially, the key fame of the oiginal video sequence and the scoeboads of sets wee exhibited. While uses can click the symbol as the aow lines indicated, the system would show the scoeboads of the coesponding games. Uses can select which game they want to watch accoding to the scoeboads of the games and click the symbol fo moe detail. Each point of the game was epesented by its key fame. Uses can view the point by clicking the coesponding key fame and the shot of the point would be displayed in the Playback Aea. By exploiting the system of video content visualization, uses can efficiently bowse video sequences. Since the length of a spot video was up to one o two hous geneally, the system thus povided a compact and bief oveall view of the match fo uses by exhibiting the textual infomation of the scoeboads hieachically. Playback Visualization Aea Fig. 10. Video content visualization system was composed of two aeas Playback and Visualization. The hieachical stuctue of the scoeboads was shown while the use clicks the symbol. 5. CONCLUSION In the pape, we have poposed a novel mechanism to detect tempoal boundaies, identify meaningful shots and then build a compact table of video content. GOP-based video segmentation was used to efficiently segment videos into shots. To efficiently detect closed captions, colo-based shot identification was poposed to identify shots of inteest, especially fo spots videos. Caption fames wee detected in the shots of inteest using the compessed data in MPEG videos. Then caption fames instead of evey fame

1160 wee selected as tagets fo detecting closed captions based on the long-tem consistency without size constaint. While closed captions wee localized, we diffeentiate the font size of closed captions based on the hoizontal pojection pofile of AC gadient enegy obtained fom both the oiginal blocks and the intepolated sub-blocks. The poposed tool font size detecto can thus be used as a pefilte to effectively eliminate uninteested closed captions and avoid most of the extemely time consuming post-pocessing of localized captions. Finally, having the poposed mechanism of high-level video stuctuing, one can bowse videos in an efficient way though a compact table of content. REFERENCES 1. H. Wang and S. F. Chang, A highly efficient system fo automatic face egion detection in MPEG video, IEEE Tansactions on Cicuits and Systems fo Video Technology, Vol. 7, 1997, pp. 615-68.. Y. Zhong, H. Zhang, and A. K. Jain, Automatic caption localization in compessed video, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol., 000, pp. 385-39. 3. H. Luo and A. Eleftheiadis, On face detection in the compessed domain, in Poceedings of the ACM Multimedia, 000, pp. 85-94. 4. Y. Zhang and T. S. Chua, Detection of text captions in compessed domain video, in Poceedings of the ACM Multimedia Wokshop, 000, pp. 01-04. 5. S. W. Lee, Y. M. Kim, and S. W. Choi, Fast scene change detection using diect featue extaction fom MPEG compessed videos, IEEE Tansactions on Multimedia, Vol., 000, pp. 40-54. 6. X. Chen and H. Zhang, Text aea detection fom video fames, in Poceedings of the nd IEEE Pacific Rim Confeence on Multimedia, 001, pp. -8. 7. J. Nang, O. Kwon, and S. Hong, Caption pocessing fo MPEG video in MC-DCT compessed domain, in Poceedings of ACM Multimedia Wokshop, 000, pp. 11-14. 8. S. Y. Lee, J. L. Lian, and D. Y. Chen, Video summay and bowsing based on stoy-unit fo video-on-demand sevice, in Poceedings of the Intenational Confeence on Infomation and Communications Secuity, 001. 9. J. L. Mitchell, W. B. Pennebake, C. E. Fogg, and D. J. LeGall, MPEG Video Compession Standad, Chapman and Hall, New Yok, 1997. 10. J. Meng, Y. Juan, and S. F. Chang, Scene change detection in a MPEG compessed video sequence, in Poceedings of the IS & T/SPIE Symposium Poceedings on Electonic Imaging: Science & Technology, Vol. 419, 1995, pp. 14-5. 11. H. J. Zhang, C. Y. Low, S. W. Smolia, and J. H. Wu, Video pasing and bowsing using compessed data, Multimedia Tools and Applications, Vol. 1, 1995, pp. 89-111. 1. H. Li, D. Doemann, and O. Kia, Automatic text detection and tacking in digital video, IEEE Tansactions on Image Pocessing, Vol. 9, 000, pp. 147-156. 13. J. C. Shim, C. Doai, and R. Bollee, Automatic text extaction fom video fo content-based annotation and etieval, in Poceedings of the 14th Intenational Confeence on Patten Recognition, 1998, pp. 618-60.

AUTOMATIC CLOSED CAPTION DETECTION AND FILTERING IN MPEG VIDEOS 1161 14. J. Ohya, A. Shio, and S. Akamastsu, Recognizing chaactes in scene images, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 16, 1994, pp. 14-0. 15. U. Gagi, S. Antani, and R. Kastui, Indexing text events in digital video databases, in Poceedings of the 14th Intenational Confeence on Patten Recognition, 1998, pp. 916-918. 16. S. Kannangaa, E. Asbun, R. X. Bowning, and E. J. Delp, The use of nonlinea filteing in automatic video title captue, in Poceedings of the IEEE/EURASIP Wokshop on Nonlinea Signal and Image Pocessing, 1997. 17. V. Wu, R. Manmatha, and E. M. Riseman, TextFinde: an automatic system to detect and ecognize text in images, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 1, 1999, pp. 14-19. 18. S. W. Lee and D. S. Ryu, Paamete-fee geometic document layout analysis, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 3, 001, pp. 140-156. 19. R. G. Casey and E. Lecolinet, A suvey of methods and stategies in chaacte segmentation, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 18, 1996, pp. 690-706. 0. D. Y. Chen and S. Y. Lee, Motion-based semantic event detection fo video content desciptions in MPEG-7, in Poceedings of the nd IEEE Pacific Rim Confeence on Multimedia, 001, pp. 110-117. 1. ISO/IEC JTC1/SC9/WG11/N3964, MPEG-7 multimedia desciption schemes XM (v7.0), Singapoe, Mach 001.. W. Qi, L. Gu, H. Jiang, X. R. Chen, and H. J. Zhang, Integating visual, audio and text analysis fo news video, in Poceedings of the Intenational Confeence on Image Pocessing, Vol. 3, 000, pp. 50-53. 3. D. Chen, K. Sheae, and H. Boulad, Text enhancement with asymmetic filte fo video OCR, in Poceedings of the 11th Intenational Confeence on Image Analysis and Pocessing, 001, pp. 19-197. 4. T. Sato, T. Kanade, E. K. Hughes, and M. A. Smith, Video OCR fo digital news achive, in Poceedings of the IEEE Intenational Wokshop on Content-Based Access of Image and Video Database, 1998, pp. 5-60. 5. Y. Aiki and K. Matsuua, Automatic classification of TV news aticles based on Telop chaacte ecognition, in Poceedings of the IEEE Intenational Confeence on Multimedia Systems, 1999, pp. 148-15. 6. D. Y. Chen, S. J. Lin, and S. Y. Lee, Motion activity based shot identification and closed caption detection fo video stuctuing, in Poceedings of the 5th Intenational Confeence on Visual Infomation System, 00, pp. 88-301. 7. A. K. Jain and B. Yu, Automatic text location in images and video fames, Patte Recognition, Vol. 31, 1998, pp. 055-076. 8. S. W. Lee, D. J. Lee, and H. S. Pak, A new methodology fo gayscale chaacte segmentation and ecognition, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 18, 1996, pp. 1045-1050.

116 Duan-Yu Chen ( 陳 敦 裕 ) eceived the B.S. degee in Compute Science and Infomation Engineeing fom National Chaio Tung Univesity, Taiwan in 1996, the M.S. degee in Compute Science fom National Sun Yat-sen Univesity, Taiwan in 1998, and the Ph.D. degee in Compute Science and Infomation Engineeing fom National Chiao Tung Univesity, Taiwan in 004. His eseach inteests include compute vision, video signal pocessing, content-based video indexing and etieval and multimedia infomation system. Ming-Ho Hsiao ( 蕭 銘 和 ) eceived the eceived the B.S. degees in Compute Sciences and Infomation Engineeing fom Fu Jen Catholic Univesity, Taiwan in 000. He eceived the M.S. degee in Compute Sciences and Infomation Engineeing fom National Chiao Tung Univesity, Taiwan, whee he is cuently pusuing the Ph.D. degee. His eseach inteests ae contentbased indexing and etieval and distibuted multimedia system, in paticula, media seve achitectue and pee-to-pee system. Suh-Yin Lee ( 李 素 瑛 ) eceived the B.S. degee in Electical Engineeing fom National Chiao Tung Univesity, Taiwan in 197, the M.S. degee in Compute Science fom Univesity of Washington, U.S.A., in 1975, and the Ph.D. degee in Compute Science fom Institute of Electonics, National Chiao Tung Univesity. He eseach inteests include content-based indexing and etieval, distibuted multimedia infomation system, mobile computing, and data mining.