ROBUST FOREGROUND SEGMENTATION FOR GPU ARCHITECTURE IN AN IMMERSIVE 3D VIDEOCONFERENCING SYSTEM. Jaume Civit, Oscar Divorra Escoda

Transcription

1 ROBUST FOREGROUND SEGMENTATION FOR GPU ARCHITECTURE IN AN IMMERSIVE 3D VIDEOCONFERENCING SYSTEM Jaume Civit, Oscar Divorra Escoda Telefonica Research, Barcelona, Spain, ABSTRACT Current telepresence systems, while being a great step forward in videoconferencing, still have important points to improve in what eye-contact, gaze and gesture awareness concerns. Many-to-many communications are going to greatly benefit from mature autostereoscopic 3D technology; allowing people to engage more natural remote meetings, with proper eye-contact and better spatiality feeling. For this purpose, proper real-time multi-perspective 3D video capture is necessary (often based on one or more View+Depth data sets). Given current state of the art, some sort of foreground segmentation is often necessary at the acquisition in order to generate 3D depth maps with hight enough resolution and accurate object boundaries. For this, one needs flicker-less foreground segmentations, accurate to borders, resilient to noise and foreground shade changes, and able to operate in real-time on performing architectures such as GPGPUs. This paper introduces a robust Foreground Segmentation approach used within the experimental immersive 3D Telepresence system from EU-FP7 3DPresence project. The proposed algorithm is based on a costs minimization using Hierarchical Believe Propagation and outliers reduction by regularization on oversegmented regions. The iterative nature of the approach makes it scalable in complexity, allowing it to increase accuracy and picture size capacity as GPGPUs become faster. In this work, particular care in the design of foreground and background cost models has also been taken in order to overcome limitations of previous work proposed in the literature. Index Terms Foreground, Belief Propagation, Segmentation, GPUs, Real-time, 3D Videoconference 1. INTRODUCTION In recent years, significant work has been performed in order to push forward visual communications and media towards a next level. Having reached a certain plateau of maturity in what 2D visual quality and definition concerns, 3D seems to be the next stage in what reality and visual experience respects. After a number of technologies, such as broadband Internet, high quality HD low-delay video compression, have become mature enough, several products have been able to irrupt into the market establishing a solid step forward towards practical Telepresence solutions. Among them, we can count large format videoconferencing systems from major providers such as Cisco Telepresence, HP Halo, Polycom, etc. However, current systems still suffer from fundamental imperfections that are known to be detrimental to the communication process. When communicating, eye contact and gaze cues are essential elements of visual communication, and of importance for signaling attention, and managing conversational flow [1, 2]. Nevertheless, current Telepresence This work has been supported by EU FP7 Project 3DPresence, Proposal no.: FP Fig. 1. Immersive Multi-Perspective simultaneous views of the 3DPresence System. The system is designed to have these two simultaneous perspectives in addition also in auto-stereoscopic 3D each. systems make it difficult for a user, mainly in many-to-many conversations, to really feel whether someone is actually looking at him/her (rather than someone else) or not, or where/who a given gesture is actually aimed at. In short, body language is still poorly transmitted by communication systems nowadays. Many-to-many communications are expected to greatly benefit from mature auto-stereoscopic 3D technology; allowing people to engage more natural remote meetings, with better eye-contact and better spatiality feeling. Indeed, 3D spatiality, object and people volume and multi-perspective nature, and depth, are very important cues that are missing in current systems. Telepresence is thus a field awaiting for mature solutions for real-time free-viewpoint (or multiperspective) 3D video (e.g. based on several View+Depth data sets). Fig. 1 shows the two simultaneous perspectives transmitted by the 3DPresence system. In Fig. 2, the picture format used to feed our auto-stereoscopic screens is depicted. In this, both view perspectives plus their respective depths are depicted. Given current state of the art, accurate and high quality 3D depth generation in real-time is still a difficult task. Some sort of foreground segmentation is often necessary at the acquisition in order to generate 3D depth maps with high enough resolution and accurate object boundaries. For this, one needs flicker-less foreground segmentation, accurate to borders, resilient to noise and foreground shade changes, as well as able to operate in real-time on performing architectures such as GPGPUs. Foreground segmentation has been studied from a range of points of view [3, 4, 5, 6, 7] each having its advantages and disadvantages concerning robustness and possibilities to properly fit within a GPGPU. Local, pixel based, threshold based classification models [3, 4] can exploit the parallel capacities of GPU architectures since they can be very easily fit within these. On the other hand, they lack robustness to noise and shadows. More elaborated approaches including morphology post-processing [5], while more robust, they may have a hard time exploiting GPUs due to their sequential processing nature. Also, these use strong assumptions with respect to objects structure, which turns into wrong segmentation when the foreground object includes closed holes. More

2 associated cost: o Pixel Label C = arg min ncost α C. (1) α {BG,FG,SH} In order to compute these costs, a number of steps are being taken such that they are as free of noise and outliers as possible. In this work, this is done by computing costs region-wise on color, temporally consistent, homogeneous areas followed by a robust optimization procedure. In order to achieve a good discrimination capacity among background, foreground and shadow, special care has been taken redesigning them as explained in the following. Fig. 2. Formatted information for the Multi-Perspective 3D Screen with 2 perspective texture views, plus their respective depth-maps. Depth-maps require foreground segmentation in order to best define contours of the salient silhouette/s with respect to plain backgrounds (as well as to lower computational load in depth computation). global-based approaches can be a better fit such that [6]. However, the statistical framework proposed is too simple and leads to temporal instabilities of the segmented result. Finally, very elaborated segmentation models including temporal tracking [7] may be just too complex to fit into real-time systems. This paper introduces a robust, real-time, Foreground Segmentation approach used within the experimental immersive 3D Telepresence system from EU-FP7 3DPresence project [8, 1]. The proposed algorithm is based on a costs minimization of a set of probability models (i.e. foreground, background and shadow) by means of Hierarchical Belief Propagation. The approach includes outlier reduction by regularization on over-segmented regions. This takes particular care of initializing the Belief Propagation step with lownoise segmentation class costs. The optimization stage is able to close holes and minimize remaining false positives and negatives. The use of a k-means over-segmentation framework enforcing temporal correlation for color centroids helps ensure temporal stability between frames. In this work, particular care in the re-design of foreground and background cost models has also been taken into account in order to overcome limitations of previous work proposed in the literature. The iterative nature of the approach makes it scalable in complexity, allowing it to increase accuracy and picture size capacity as commercial GPGPUs become faster. The results are good and robust, fulfilling as well the robustness/complexity needs of our immersive 3D Telepresence system. The remaining of the paper is structured as follows: first in Section 2, the problem formulation as an energy function minimization is described. Section 3 explains in further detail the proposed implementation, the different computation stages involved and particular aspects relative to GPU. Results are evaluated in Section 4. Finally, conclusions are drawn in Section GENERAL PROBLEM STATEMENT AND MODELS FORMULATION 2.1. Segmentation Problem Formulation In this work, the segmentation process is posed as a cost minimization problem. For a given pixel, a set of costs are derived from its probabilities to belong to the foreground, background or shadow classes. Each pixel will be assigned the label that has the lowest 2.2. Foreground, Background and Shadow Models In order to define the set of cost functions corresponding to the three segmentation classes, we build upon [6]. However, in our case, the definitions of Background and Shadow models are redefined in order to make them more accurate and reduce the temporal instability in the classification phase. For this, we revisit [4] and we derive equivalent background and shadow probability models based on chromatic distortion (3), color distance and brightness (2) measures. Unlike in [4] though, where models were fully defined to work on a threshold based classifier, we reformulate them here from a Bayesian point of view. This is performed such that additive costs can be derived after applying the logarithm to the probability expressions found. Thanks to this, models can then be used within the optimization framework chosen for this work. As a reminder, brightness and color distortion (with respect to the trained background model) are defined as follows. First, brightness (BD) is such that BD( C) = Cr Crm + C g C gm + C b C bm 2 2 C rm + C g m + C b 2, (2) m where C = {C r, C g, C b } is a pixel or segment color with rgb components, and C m = {C rm, C gm, C bm } is the corresponding trained mean for the pixel or segment color in the background. The chroma distortion can be simply expressed as : CD( C) = r (C r BD( C) C rm ) 2 + (C g BD( C)... C gm ) 2 + (C b BD( C) C bm ) 2. Based on these, we define the cost for Background as Cost BG( C) = C C m 2 + CD( 2 C) 5 σm 2 K 1 5 σcd 2, (4) m K 2 where σ 2 m represents the variance of that pixel or segment in the background, and σ 2 CD m is the one corresponding to the chromatic distortion. Akin to [6], the foreground cost can be just defined as Cost FG( C) = (3) K3. (5) 5 We finally design the cost related to shadow probability as Cost SH( C) = CD( C) K4 5 σcd 2 m K 2 BD( C) 2 «1 log 1. 2 π σ 2 m K 1 (6)

3 Frame(t) Frame(t 1) Clusters K Means Clustering with Temporal Constraint Homogeneous Regions Segmentation 1 Z Frame(t) Clusters Estimation of Region Statistics Region Re Projection for Sharp Contours Segm. SegmentationMask(t) FG/BG/SH Segmentation Model Initialization Hierarchical BP Optimization (pixel wise) FG/BG/SH Transition Modelling Fig. 3. Segmentation Algorithmic Block Architecture In (4), (5) and (6), K 1, K 2, K 3 and K 4 are adjustable proportionality constants corresponding to each of the distances in use in the costs above. These constants act like the different adjustable thresholds in [4]. In this work, thanks to the normalization factors in the expressions, once fixed all K x parameters, results remain quite independent from scene, not needing additional tuning based on content. Later in the paper, those values used in our case will be given along with results. 3. IMPLEMENTATION 3.1. Overall Algorithm Description Section 2 models, while applicable pixel-wise in a straightforward way, would not provide satisfactory enough results if not used in a more structured computational framework. Robust segmentation requires, at least, to exploit the spatial structure of content in a more global manner in addition to the local modeling of foreground, background and shadow classes. For this purpose, in this paper, we estimate pixels costs as an average over temporally stable, homogeneous color regions [9]. First of all, the image is over-segmented using a homogeneous color criteria (see Fig. 4). This is done by means of a k-means approach. Furthermore, in order to ensure temporal stability and consistency of homogeneous segments, a temporal correlation is enforced on k-means color centroids. Then segmentation model costs are computed per color segment. After that, hierarchical Belief Propagation [10] is used to find the best possible global solution by optimizing costs. Optionally, and after Belief Propagation, the final decision can be performed pixel or region-wise on final averaged costs computed over uniform color regions to further refine foreground boundaries. Fig. 3 depicts the block architecture of the algorithm Regularized Costs Estimation on Time-Consistent Oversegmented Regions In order to use the image s spatial structure in an computationally affordable way, several methods have been considered taking into account the available hardware in our system. For this, while a large number of image segmentation techniques is available, we need to exploit the power parallel architecture of Graphics Processing Units (GPU) available on computers nowadays. Knowing that the initial segmentation is just going to be used as a support stage for further computation, a good approach is a k-means clustering based segmentation [11]. k-means clustering is a well known algorithm for cluster analysis used in numerous applications. Given a group of samples (x 1, x 2,..., x n), where each sample is a d-dimensional real vector, in this case (R, G, B, x, y), where R, G and B are pixel colors, and x, y are its coordinates in the image space, it aims to partition the n samples into k sets S = S 1, S 2,..., S k such that: arg min s kx i=1 X where µ i is the mean of points in S i. X j S i X j µ i 2, (7) Clustering is a hard time consuming process, mostly for large data sets. Hence, we introduce some constraints and slight modifications to the main method, which help it to fit better to the problem and the particular GPU architecture (i.e. number of cores, threads per block, etc...) in use in our case. The common k-means algorithm proceeds by alternating between assignment and update steps: Assignment: Assign each sample to the cluster with the closest mean. n S (t) i = X j : X j µ (t) i X j µ (t) i,... i = 1,...k} (8) Update: Calculate the new means to be the centroid of the cluster. µ (t+1) i = 1 X X S (t) j (9) i X j S (t) i The algorithm converges when assignments no longer change. In our implementation, the initial Assignment set (µ (1) 1,, µ(1) k ) is constrained to the parallel architecture of GPU by means of a number of sets that also depend on the image size. The input is split into a grid of n n squares, achieving (M N) clusters where N n and M are the image dimensions. The initial 2 Update step is computed from the pixels within these regions. With this, we help the algorithm to converge in a lower number of iterations. The second constraint introduced is in the Assignment step. Each pixel can only change cluster assignment to a strictly neighboring k-means cluster such that spatial continuity is ensured. The initial grid, and the maximum number of iterations allowed, strongly influence the final size and shape of homogeneous segments. In these steps, n is related to the block size used in the execution of process kernels within the GPU. The above constraint leads to: S (t) i = n X j : X j µ (t) i o X j µ (t) i, i N(i) (10) Where N(i) is the neighborhood of cluster i. Finally, considering the strong temporal correlation from frame to frame in our video application, final resulting centroids after k-means segmentation of a frame are used to initialize the oversegmentation of the next one. This helps to further accelerate the converge of the initial segmentation while also improving the temporal consistency of the final result between consecutive frames. There is no doubt that current approach produces local optima solutions. However, these are sufficient to cover our requirements before the remaining steps: region-wise average costs computation, global optimization by belief propagation, and k-means region-wise foreground/background decision. As shown in Fig. 4 the resulting

4 accuracy of the result on small details. Finally, the result of the global optimization step is used for classification based on (1) either pixel-wise or region-wise with a re-projection into the segmentation space in order to improve the boundaries accuracy. 4. RESULTS Fig. 4. Color based over-segmentation for robust foreground segmentation initialization. Up: Full scene. Middle: Full scene oversegmented with our GPU-adapted k-means stage. Down: Detail of the over-segmentation on the face of the person and surroundings. On large homogeneous regions, one can appreciate the shape of the initial grid that persists through iterations. regions are small but big enough to account for the image s spatial structure in the calculation. In terms of implementation, the whole segmentation process is developed in CUDA (NVIDIA C extensions for their graphic cards). Each step, assignment and update, are built as CUDA kernels for parallel processing. Each of the GPU s thread works only on the pixels within a cluster. The resulting centroid data is stored as texture memory while avoiding memory misalignment. A CUDA kernel for the Assignment step stores in a register per pixel the decision. The Update CUDA kernel looks into the register previously stored in texture memory and computes the new centroid for each cluster. Since real-time is a requirement for our system, the number of iterations is limited to n, where n is the size of initialization grid Making Foreground Segmentation Robust with Hierarchical Believe Propagation Optimization After the initial geometric segmentation, the next step is the generation of the region-wise averages for chromatic distortion (CD), Brightness (BD) and other statistics required in Section 2 models. Following to that, the next step is to find a global solution of the foreground segmentation problem. Once we have considered the image s spatial structure through the regularization of the estimation costs via our customized k-means clustering method, we need a global minimization algorithm which fits our real-time constraints. A well known algorithm is the one introduced in [10], which implements an hierarchical belief propagation approach. Again, a CUDA implementation of this algorithm is in use in order to maximize parallel processing within every of its iterations. Specifically, in our implementation three levels are being considered in the hierarchy with 4, 2 an 1 iterations per level (from finer to coarser resolution levels). We assign less iterations for coarser layers of the pyramid, in order to balance speed of convergence with and resolution losses on the final result. A higher number of iterations in coarser levels makes the whole process converge faster but also compromises the In the following, a series of results are presented to illustrate the performance of the presented segmentation algorithm, as well as some results where it is combined, as well as used as side information for the HRM-based [12] depth estimation used in the 3DPresence system. The algorithm is always used with the same parameters tuning. Furthermore, in our setting these have been configured once and do not require adjustments depending on the scene. According to the constants defined in Sec. 2, these are set such that: K 1 = 10, K 2 = 30, K 3 = 1.0, K 4 = 0.2. As discussed previously, the hierarchical Belief Propagation uses 3 levels with 1, 2 and 4 iterations at each level from coarse to fine respectively. The homogeneous color segmentation is based on a 8 iterations k-means as described in Sec. 3. Finally, 200 frames have been used for the training process of the background model statistics. In our particular application, only segments labeled as foreground are of interest. Hence, background and shadow are merged in the end into one single background segment. Nevertheless, the class and cost defined for Shadow regions needs to be considered and used for computations. Otherwise, false negatives and positives background-foreground classifications would appear all around the picture. Based on these settings, the foreground segmentation algorithm performance with one GPU from a GTX295 card appears to be as follows for 1376x384 and 688x192 picture resolutions. These imply that HD high quality segmentation in real-time is within reach by using 2 GTX295 cards if needed. Resolution 1376x x192 Comp. Time / frame ms 44.8 ms Overall results within the scope of the application, have shown to be very robust and consistent in terms of segmentation quality. An important feature of the algorithm and its results is the temporal consistency of segments shape and geometry. The use of an initial step based on over-segmentation, temporal consistency of K-means centroids, plus the later step of Belief Propagation help maintaining consistent shape through time, avoiding region blinking, segmentation outliers or other inconsistencies. Fig. 5 and Fig. 6 depict 2 frames from two different sequences where one can appreciate the accurate foreground segmentation achieved by the algorithm. In them, the original scene, the segment, and the masked scene with the segment can be observed. A particular detail, is that holes and small details such as fingers are well preserved in the segmentation. As discussed previously in the paper, some algorithms that assume objects to be closed regions for morphological post-processing are unable to keep such details. We can see in Fig. 7 the set of multi-perspective information with both views, and respective depths, used for the 3DPresence multi-perspective 3D screens. These have been generated with the depth estimation module that combines the segmentation algorithm presented in this work together with the depth estimation HRM algorithm [12]. Both together are able to define very accurate person boundaries together with a well fulfilled depth information within

5 Fig. 5. Segmented result from sequence 1 at 1376x384. Fig. 7. Formatted information for the Multi-Perspective 3D Screen with 2 perspective texture views, plus their respective segmented depth-maps. One can appreciate the difference in depth from the extended arm, as well as the sharp contours that correctly define the person boundary. Fig. 6. Segmented result from sequence 2 at 1376x384. the segment, as can be seen in the extended arm detail in the picture. Finally, Fig. 8 depicts the result in the situation where the foreground contains details with a color that is similar or equal to the background. In this particular case, despite the shirt of the person is striped with stripes of a color very close to that of the background, results are consistent and still good. The use of Belief Propagation over the low-noise initial segmentation result exploits the large scale structure of actual objects (compared to the background-like color regions) in order to help keeping segments closed and reducing holes and segmentation outliers. This can be further appreciated in the overall video used to generate these results. Despite a backgroundlike color appears on the person shirt, the foreground segment keeps accurate, closed, and above all very stable through time. This video sequence can be downloaded by the interested from [13]. Fig. 8. Segmented result from sequence 2 at 1376x384. Despite the shirt of the person is striped with stripes of a color very close to that of the background, results are consistent and still good. 5. CONCLUSIONS This paper has presented a robust foreground segmentation for realtime operation on GPU architectures. This approach is suitable for combination with real-time depth estimation algorithms for stereomatching acceleration, flat region outlier reduction and depth boundary enhancement between regions. The statistical models provided in this work, plus the use of over-segmented regions for statistics estimation, have been able to make the foreground segmentation more stable in space and time, while usable in real-time on one of the 2 GPUs on a GTX295 (at 22fps) for a resolution around 700x200. In future work, we plan to introduce the use of ToF cameras within the framework presented in here in order to further improve the resilience to excessively close (or excessively dark) shadows.

6 6. REFERENCES [1] O. Divorra Escoda, J. Civit, F. Zuo, H. Belt, I. Feldmann, O. Schreer, E. Yellin, W. Ijsselsteijn, R. van Eijk, D. Espinola, P. Hagendorf, W. Waizenneger, and R. Braspenning, Towards 3d-aware telepresence: Working on technologies behind the scene, in New Frontiers in Telepresence workshop at ACM CSCW, Savannah, GA, Feb [2] C. L. Kleinke, Gaze and eye contact: A research review, Psychological Bulletin, vol. 100, pp , [3] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis, Non-parametric model for background subtraction, in Proceedings of International Conference on Computer Vision. Sept 1999, IEEE Computer Society. [4] T. Horpraset, D. Harwood, and L. Davis, A statistical approach for real-time robust background subtraction and shadow detection, in IEEE ICCV, Kerkyra, Greece, [5] J. L. Landabaso, M. Pardàs, and L.-Q. Xu, Shadow removal with blob-based morphological reconstruction for error correction, in IEEE ICASSP, Philadelphia, PA, USA, March [6] J.-L. Landabaso, J.-C Pujol, T. Montserrat, D. Marimon, J. Civit, and O. Divorra, A global probabilistic framework for the foreground, background and shadow classification task, in IEEE ICIP, Cairo, November [7] J. Gallego Vila, Foreground segmentation and tracking based on foreground and background modeling techniques, M.S. thesis, Image Processing Department, Technical University of Catalunya, [8] I. Feldmann, O. Schreer, R. Shfer, F. Zuo, H. Belt, and O. Divorra Escoda, Immersive multi-user 3d video communication, in IBC, Amsterdam, The Netherlands, Sep [9] C. Lawrence Zitnick and Sing Bing Kang, Stereo for imagebased rendering using image over-segmentation, in International Journal in Computer Vision, [10] P. F. Felzenszwalb and D. P. Huttenlocher, Efficient belief propagation for early vision, in CVPR, 2004, pp [11] J. B. MacQueen, Some methods for classification and analysis of multivariate observations, in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman, Eds. 1967, vol. 1, pp , University of California Press. [12] O. Schreer N. Atzpadin, P. Kauff, Stereo analysis by hybrid recursive matching for real-time immersive video stereo analysis by hybrid recursive matching for real-time immersive video conferencing, vol. 14, no. 3, March [13] Video sequence result for the segmentation on shirt with background-like color stripes, remository& Itemid=2&func=startdown&id=124.