1 Fast and Robust Moving Object Segmentation Technique for MPEG-4 Object-based Coding and Functionality Ju Guo, Jongwon Kim and C.-C. Jay Kuo Integrated Media Systems Center and Department of Electrical Engineering-Systems University of Southern California, Los Angeles, CA ABSTRACT Video object segmentation is an important component for object-based video coding schemes such as MPEG-4. A fast and robust video segmentation technique, which aims at ecient foreground and background separation via eective combination of motion and color information, is proposed in this work. First, a non-parametric gradientbased iterative color clustering algorithm, called the mean shift algorithm, is employed to provide robust dominant color regions according to color similarity. With the dominant color information from previous frames as the initial guess for the next frame, the amount of computational time can be reduce to 50%. Next, moving regions are identied by a motion detection method, which is developed based on the frame intensity dierence to circumvent the motion estimation complexity for the whole frame. Only moving regions are further merged or split according to a region-based ane motion model. Furthermore, sizes, colors, and motion information of homogeneous regions are tracked to increase temporal and spatial consistency of extracted objects. The proposed system is evaluated for several typical MPEG-4 test sequences. It provides very consistent and accurate object boundaries throughout the entire test sequences. Keywords: video segmentation, color segmentation, mean shift algorithm, ane motion, spatial segmentation, motion detection. 1 INTRODUCTION An object- or content-based coding scheme plays a signicant role in the success of the second generation video coding . The object-based coding has the potential to provide a more accurate video representation at the very low bit rate video. It also allows content based functionalities such as object manipulation. In the recent development of the MPEG-4 standard , video coding is handled by the object unit, namely, the video object plane (VOP). VOP represents one snap shot of an object in video. For each VOP, the motion, texture, and shape information is coded in separate bit streams. This allows separate modication and manipulation of each VOP and supports the content based functionality. Thus, video segmentation, which aims at the exact separation of moving objects from the background, becomes the foundation of content based video coding, among many other interesting applications. Even though the image and video segmentation problem has been studied for more than thirty years, it is still considered one of the most challenging image processing tasks, and demands creative solutions for major breakthrough. Most existing video segmentation algorithms attempt to exploit the temporal and spatial coherence information inherent in the image sequence to achieve foreground/background separation. Temporal segmentation can identify moving objects since most moving objects have distinct motion patterns from the background. Spatial segmentation can determine object boundaries accurately if underlying objects have a dierent visual appearance (such as the color or the gray level intensity) from the background. An ecient combination of spatial-temporal segmentation
2 modules can lead to a more promising solution to the segmentation problem. It is desirable to develop an automatic segmentation algorithm that requires no user assistance and interaction. In addition, the availability of a fast implementation is also one basic requirement, which is especially needed for real time applications. A fast and robust video segmentation technique is proposed in this work. It can be roughly described below. First, a non-parametric gradient-based iterative color clustering algorithm, called the mean shift algorithm, is employed to provide robust dominant color regions according to color similarity. With the dominant color information from previous frames as the initial guess for the next frame, the amount of computational time can be reduce to 50%. Next, moving regions are identied by a motion detection method, which is developed based on the frame intensity dierence to circumvent the motion estimation complexity for the whole frame. Only moving regions are further merged or split according to a region-based ane motion model. Furthermore, sizes, colors, and motion information of homogeneous regions are tracked to increase temporal and spatial consistency of extracted objects. The paper is organized as follows. We rst review previous segmentation work in Section 2. A general description of the proposed segmentation algorithm is given in Section 3. Video segmentation results for MPEG-4 test video are presented in Section 4. They are compared with results of the three algorithms recommended in MPEG-4. Concluding remarks are given in Section Three MPEG-4 Algorithms 2 Review of Previous Work Up to now, three algorithms for automatic video segmentation have been proposed in MPEG-4 visual standard . They are temporal segmentation from Fondazione Ugo Bordoni (FUB), temporal segmentation from University of Hannover (UH), and spatial-temporal segmentation from Electronics and Telecommunications Research Institute (ETRI). All of these algorithms classify pixels in an image sequence into two classes, i.e. moving objects (foreground) and background. In the algorithm proposed by ETRI, images are rst simplied by morphological lters. These lters remove regions that are smaller than a given size but preserve contours of remaining objects. The morphological gradient operator is used to estimate the luminance gradient. The region boundary is obtained by the watershed algorithm, where the similarity measure is obtained from the combination of the luminance gradient and the motion eld. Finally, similar regions are merged based on graph theory. The algorithm of UH uses two successive frame dierences to obtain a change detection mask. The uncovered background is removed by a hierarchical block matcher. The region boundaries of the change detection mask are adapted to luminance edges to improve the segmentation accuracy with respect to object boundaries. For the algorithm from FUB, a group of frames is rst selected, and dierences of each frame in the group with respect to the rst frame are evaluated. A robust fourth-order statistic test of frame dierences is performed to detect the change areas. Motion is estimated to remove the uncovered background area. Morphological open and close operators are used to rene region boundaries. In the automatic video segmentation framework, statistic change detection and motion estimation are used in the temporal domain, while the luminance based morphological operation and the watershed algorithm can be used to segment objects within an image in the space domain , . Due to the complexity of segmentation modules, these algorithms are not suitable for the real-time implementation. Also, the segmentation result is still unsatisfactory for several typical videos, where the shape of the object can not be precisely dened. 2.2 Integrated Spatial-Temporal Approach The integration of temporal- and spatial-segmentation results can improve the performance with an increased complexity. Bouthemy and Francois  proposed a technique to simultaneously estimate the spatial-temporal segmentation and motion by adopting a Markov Random Field (MRF) model and Bayesian estimation. Since the MRF model is constructed in terms of local constraints on the luminance intensity and motion, the spatial information and motion can be taken into account simultaneously, and motion estimation and segmentation can be optimized jointly. However, the disadvantage of this technique is its high complexity. The spatial and temporal segmentation steps can be iteratively performed to reduce the complexity . Next, since the human visual system (HVS) is very sensitive totheedgeandcontour information, exact extraction of object boundaries is crucial for
3 visual quality of segmented results. More visual information should be used to make spatial segmentation robust and consistent. 2.3 Motivation and Summary of the Proposed Approach Among many visual cues, the color information has not yet been fully exploited in video segmentation, since it is often perceived that human eyes are not too sensitive to the chrominance components, e.g. the UV data in the YUV-format video and the contribution from the color information is treated as the second order eect. Furthermore, additional computational complexities are required for color processing. We believe that the color information does play an important role in object identication and recognition in the human visual system (HVS), and it is worthwhile to include the information in the computation. Zhong and Chang  applied color segmentation to separate images into homogeneous regions, and tracked them along time for content-based video query. A simple uniform quantization in the L? u? v? color space was used in their work. Kanai used the uniform quantization in the HSV color space for the image segmentation . The uniform color quantization was adopted in both work to reduce the complexity of segmentation. In this work, we focus on automatic video segmentation by proposing a fast and adaptive algorithm with a reduced complexity in both spatial and temporal domains. A fast yet robust adaptive color segmentation based on the mean shift color clustering algorithm is applied in the spatial domain. The mean shift algorithm has been generalized by Cheng for clustering data, and used by Comaniciu and Meer for color segmentation . For the k-means clustering method, it is dicult to choose the initial number of classes. By using the mean shift algorithm, the number of dominant colors can be determined automatically. Here, we develop a non-parametric gradient-based algorithm that provides a simple iterative method to determine the local density maximum. The number of color classes in the current frame can be used as the initial guess of color classes for the next frame. This helps in reducing the computational complexity of color segmentation. For the temporal domain, a noise robust higher order statistic motion detection algorithm and a color region based ane motion model is employed. After separating an image frame into homogeneous spatial regions, we determine whether each region belongs to the background or the foreground by motion detection. Only moving regions are further merged or split using the region based ane motion model. The six parameters of the ane motion model are estimated for each region. Regions with similar motion parameters are merged. Regions that do not t to the ane motion model well are split. The size, color, and motion information of each region is tracked to increase the consistency of extracted objects. The system is applied to segment several MPEG-4 test video clips. We have observed accurate object boundaries and the temporal and spatial consistency from experimental results. 3 PROPOSED VIDEO SEGMENTATION ALGORITHM The block diagram of the proposed automatic video segmentation algorithm is given in Fig. 1. It consists of 4stages. At the rst stage, the global motion compensation (GMC) procedure is performed. That is, the global motion of image sequences is estimated using the six-parameter ane model. With this information, images can be aligned accordingly. At the second stage, the mean shift color segmentation algorithm is used to partition an image into homogeneous regions. This is basically a spatial domain segmentation approach. At the third stage, we attempt to use temporal information for segmentation. A statistical motion detection approach is used to determine whether each homogeneous region is moving or not. Only for moving regions, we apply the ane motion model for motion parameter estimation. Regions can be merged and split according to the consistency of their motion parameters. The size, color and motion data of nal segmented regions are tracked. At the last stage, the morphological open and close lters are used to smooth object boundaries and eliminate small regions. These building blocks of the proposed algorithm are detailed below. 3.1 Global Motion Compensation A six-parameter ane motion model is used to estimate the global motion, which is due to camera movement such as panning, zooming, and rotation. Motion vectors are estimated based on block matching of pixel block. Motion vectors with a large matching residual are rejected. Six parameters of the ane motion model are
4 Figure 1: The block diagram of the proposed automatic video segmentation algorithm. estimated using the macroblock based motion vectors. Once the global motion is detected and estimated, multiple frames are aligned accordingly to reduce the eect of the camera movement. 3.2 Color Segmentation The intensity distribution of each color component can be viewed as a probability density function. The mean shift vector is the dierence between the mean of the probability function onalocalarea and the center of this region. In terms of mathematics, the mean shift vector associated with a region S ~x centered on ~x can be written as: ~V (~x) = R ~y2s p(~y)(~y ; ~x)d~y R ~x ~y2s p(~y)d~y ~x where p() is the probability density function. The mean shift algorithm says that the mean shift vector is proportional to the gradient of the probability density rp(~x), and reciprocal to the probability density p(~x), i.e. ~V (~x) =c rp(~x) p(~x) where c is a constant. Since the mean shift vector is along the direction of the probability density maximum, we can exploit this property to nd the actual location of the density maximum. In implementing the mean shift algorithm, the size of the search window can be made adaptive to an image by setting the radius proportional to
5 the trace of the global covariance matrix of the given image. By moving search windows in the color space using the mean shift vector iteratively, one dominant color can be located. After removing all colors inside the converged search window, one can repeat the mean shift algorithm again to locate the second dominant color. This process can be repeated several times to identify a few major dominant colors. The uniform color space L? u? v? was used by Comaniciu et al.  for color segmentation due to its perceptual homogeneity. To reduce the computational complexity, we use the YUV space for color segmentation since original video data are stored in the YUV format. The obtained results are comparable with those based in the L? u? v? space. Dominant colors of the current frame are used as the initial guess of dominant colors in the next frame. Due to the similarity of adjacent frames, the mean shift algorithm often converges in one or two iterations, thus reducing the computational time signicantly. Color segmentation also uses the spatial relation of pixels as a constraint  as described below. For each frame, dominant colors are rst generated by the mean shift algorithm. Then, all pixels are classied according to their distance to dominant colors. A relative small distance is used as a threshold to determine which classes the pixel is belong to in the beginning. Afterwards, the threshold is doubled. Only the pixel that has a smaller distance to the dominant color and has one of its neighboring pixels assigned to the same class can be classied to this class. Finally, unassigned pixels are classied to its nearest neighboring region. 3.3 Motion Detection and Estimation A robust motion detection method based on the frame dierence calculation is used to determine whether homogeneous regions are moving or not . Since the statistical behavior of inter-frame dierences produced by object movement strongly deviates from the Gaussian model, a fourth-order statistic adaptive detection of the non-gaussian signal is performed. For each pixel at (x y), its fourth order moments ^m d (x y) isevaluated as ^m d (x y) = 1 9 X (s t)2w(x y) (d(s t) ; ^(x y)) 4 where d(x y) is the inter-frame dierence, W (x y) is 3 3 window centered at (x y) and ^(x y) is the sample mean of d(x y) inside window W (x y), i.e. ^(x y) = 1 9 X d(s t): (s t)2w(x y) Each pixel at (x y) is determined to be associated with the still background or the change region according to its fourth moment ^m d (x y). The change regions obtained from the higher statistic estimation include the uncovered background. The block matching algorithm is applied to the fourth order moment maps of frame dierences in order to remove the uncovered background. Pixels that have null displacements are reassigned to the background. For each homogeneous region, if 85% of pixels are identied as moving pixels, the region is identiedasmoving. Only for moving regions, the motion vector eld is estimated by using hierarchical block matching methods inside the regions. Each motion vector is labeled with a condent factor based on the matching mean square error. The ane motion model is described as: (x y) (x y) = a1 + a2x + a3y a4 + a5x + a6y where (x y) and(x y) are the motion vectors at horizontal and vertical directions, a1 a6 are constant parameters. The 6 parameters (a1 a6) of the ane model are estimated by using motion vectors with high condent factors. The obtained parameters are tested for the whole region by calculating the mean square error E dened as E R = 1 N X (x y)2r [(x y) ; ^(x y)] 2 +[(x y) ; ^(x y)] 2
6 where N is the number of pixels inside region R, ^(x y) and^(x y) are the estimated motion vectors by the ane model. When the mean square error E R for the region R is above a certain threshold, it implies that the ane motion model cannot describe the motion of that region well so that the region should be split according to the motion information. Thus, the mean square error E will be reduced. Regions with similar motion and color are merged together. The motion similarity measure is determined on 6 parameter space, which is 6X S(R1 R2) = (a R1 i i=1 ; a R2 i ) 2 where a R1 i and a R2 i are the ane model parameters for region R1 and R2, respectively. If S(R1 R2) is small, the region R1 and R2 are merged together. Moving objects are projected to the next frame according to their ane motion models. The projected region boundaries are aligned to the current region boundaries by region matching. For unmatched regions, the change of detection is used to nd moving regions. For each new moving region, we repeat the process of motion estimation, region splitting and merging. This process allows the detection of newly appeared objects in the scene. 3.4 Postprocessing The object masks obtained from the spatial and temporal segmentation sometimes have irregularity in the boundaries, such small gulfs, or isthmi, due to temporal and spatial signal uctuations. This will give visually annoying appearance and also increase the shape coding cost. We use the morphological open and close operators to remove the gulfs and isthmi, and to smooth the object boundaries to increase the shape coding eciency. A circular structuring element with 2 pixel radius is used in the morphological open and close operation. 4 EXPERIMENTAL RESULT Two MPEG-4 QCIF sequences, i.e. \Akiyo" and \Mother and daughter", are used to test the proposed algorithm. For the \Akiyo" sequence, there is only a small motion activity in the head and shoulder regions. The original 10th and 20th image frames are shown in Fig. 2(a). The results of color segmentation are given in Fig. 2(b). We can clearly see that each image is segmented into a few regions. For example, Akiyo is segmented into the hair region, the facial region, and the shoulder region. Each region has a well-aligned boundary corresponding to the real object. The motion detection algorithm identies the moving region, which is given in Fig. 2(c). The boundary is not well detected as compared with the real object boundary by using the motion information only. By incorporating the spatial color segmentation result, the nal segmentation result is much improved as shown in Fig. 2(d). For the \Mother and daughter" sequence, there are more head and hand motion activities than \Akiyo". The results of color segmentation is shown in Fig. 3(b), for two dierent frames (i.e. the 20th and 250th frames). More regions are obtained from color segmentation. All these regions are identied as belonging to either the background or the foreground. Regions, such as mother's head and shoulder, daughter's hair, shoulder and face, have contours which correspond to real objects. These objects, identied by motion detection and dened by color regions, were accurately segmented from the background as given in Fig. 3(d). Although many segmentation algorithms have been proposed, it is still a very dicult problem to evaluate the quality of the generated video objects. In MPEG-4, only subjective evaluation by tape viewing was adopted to decide the quality of segmentation results. It is desirable to use to an objective measure by comparing the segmented object with the reference object. Two criteria, i.e. spatial accuracy and temporal coherency of the video object, are important measure for the quality of a certain algorithm. Recently, Wollborn and Mech  proposed a simple pixel-based quality measure. The spatial distortion of an estimated binary video object mask at frame t is dened as d(a est t A ref t )= P(x y) Aest t (x y) A ref t (x y) P(x y) Aref t (x y)
7 (a) (b) (c) (d) Figure 2: The segmentation results of the \Akiyo" QCIF sequence with respect to the 10th and the 20th frames: (a) the original images, (b) the color segmentation results, (c) the motion detection results and (d) the nal results.
8 (a) (b) (c) (d) Figure 3: The segmentation results of the \Mother and daughter" QCIF sequence with respect to the 20th and the 250th frames: (a) the original images, (b) the color segmentation results, (c) the motion detection results and (d) the nal results.
9 where A ref t and A est t are the reference binary object mask and the estimated one at frame t, respectively, and is the binary \xor" operation. Temporal coherency is measured by (t) =d(a t A t;1) where A t and A t;1 are the binary mask at frame t and t ; 1, respectively. Temporal coherency est (t) for the estimated binary mask A est should be compared to temporal coherency ref (t) of the reference mask. Any signicant deviation from the reference indicates a bad temporal coherency. The segmentation results of this paper are evaluated using both criteria. The results of "Akiyo" QCIF sequence are shown in Fig. 4(a) and Fig 4(b). For the reference mask, the hand segmented mask from the MPEG-4 test material distribution is utilized. In Fig. 4(a), the dot line is obtained by using higher statistic motion detection only and the solid line is the proposed scheme. We can see that the spatial accuracy is much improved by using the color segmentation algorithm. The error is less than 2% in most frames. In Fig. 4(b), the solid line denotes the reference mask, the dot line the proposed scheme, and the dash line the motion detection using the high order statistic method only. The temporal coherency curve also closely follows the one in the reference mask. Since the reference object mask for the "Mother and Daughter" QCIF sequence is not available, only the temporal coherency is evaluated and plotted in g. 5. With the proposed color segmentation, the curve of temporal coherency is much smoother than the one with only motion detection, which demonstrates a better performance Spatial Accuracy Temporal Coherency Frame No. (a) Frame No. (b) Figure 4: The objective evaluation of the "Akiyo" QCIF sequence object mask for (a) spatial accuracy and (b) temporal coherency. We can see from these results that temporal segmentation can identify moving regions while spatial segmentation provides the important information of object boundaries. Our algorithm exploits spatial information of color similarity and obtain the accurate region boundary automatically. Since the human visual system is very sensitive to edge information, our segmentation results provide better visual quality than those in MPEG-4 due to the more accurate boundary information (see Fig. 6 and Fig. 7). 5 CONCLUSION A new video segmentation algorithm for the MPEG-4 object based coding was proposed in this work. The proposed segmentation scheme can lead to fast object segmentation. Also, the color segmentation combined with the region-based motion detection gives a very accurate video segmentation result. The performance of the proposed segmentation scheme was demonstrated via several experimental results. 6 REFERENCES  L. Torres and M. Kunt, Video Coding (The Second Generation Approach), Kluwer Academic Publishers, 1996.
10 Temporal Coherency Frame No. Figure 5: The objective evaluation of the "Mother and Daughter" QCIF sequence object mask by temporal coherency, where the solid line denotes the result of the proposed method while the dot line uses only motion detection with higher order statistics. (a) (b) (c) (d) Figure 6: Performance comparison of MPEG-4 automatic segmentation methods and the proposed method for the \Akiyo" QCIF sequence: (a) MPEG-4 FUB algorithm , (b) MPEG-4 ETRI algorithm , (c) MPEG-4 UH algorithm  and (d) the proposed algorithm.
11 (a) (b) (c) (d) Figure 7: Performance comparison of MPEG-4 automatic segmentation methods and the proposed method for the \Mother and daughter" QCIF sequence: (a) MPEG-4 FUB algorithm , (b) MPEG-4 ETRI algorithm , (c) MPEG-4 UH algorithm  and (d) the proposed algorithm.
12  "Information Technology - Coding of Audio-Visual Objects:Visual", Doc.ISO/IEC Final Committee Draft, May  J. Ohm, Ed.,"Core experiments on multifunctional and advanced layered coding aspects of MPEG-4 video", Doc. ISO/IEC JTC1/SC29/WG11 N2176, May  C. Gu and M.G. Lee, \Semantic video object segmentation and tracking using mathematical morphology and perspective motion model," IEEE International Conference on Image Processing, Santa Barbara, CA, Oct  P. Bouthemy and E. Francois, \Motion segmentation and qualitative dynamic scene analysis from an image sequence", Int. Journal of Computer Vision, Vol. 10, pp ,  F. Dufaux and F. Moscheni, \Spatio-temporal segmentation based on motion and static segmentation", IEEE International Conference on Image Processing, Washington, Oct  D. Zhong and S.F. Chang, \Video object model and segmentation for content-based video indexing," in IEEE International Symposium on Circuits and Systems, Hong Kong, June  Y. Kanai, \Image segmentation using intensity and color information," in Visual Communications and Image Processing, Jan  Y. Cheng, \Mean shift, mode seeking, and clustering," in IEEE Trans. Pattern Anal. Machine Intell., Vol.17, pp ,  D. Comaniciu and P. Meer, \Robust analysis of feature space: color image segmentation,\ in Computer Vision and Pattern Recognition, San Juan, Puerto Rico, June,  M. Wollborn and R. Mech, \Rened procedure for objective evaluation of video object generation algorithms," Doc. ISO/IEC JTC1/SC29/WG11 M3448, March, 1998.