Silhouette completion in DIBR with the help of skeletal tracking

Transcription

1 KASIMIDOU ELEFTHERIA Silhouette completion in DIBR with the help of skeletal tracking Diploma Thesis Supervision: Prof. Pascal Frossard Prof. Anastasios Delopoulos Autumn 2015

2

3

4 ACKNOWLEDGMENTS Working on my diploma thesis has been an exciting time which provided me with a deeper understanding and valuable experience on the field of Image Processing and Telecommunications. First and foremost, I would like to express my gratitude to my supervisors Prof. Pascal Frossard and Prof. Anastasios Delopoulos for the opportunity to work on an interesting project and for their insightful guidance. Moreover, I would like to thank Thomas Maugey and Ana de Abreu for the discussion and advice on the theoretical part of the thesis. A very special thanks goes out to Mrs. Zoe Kantaridou for her thorough feedback on the final stage of the writing process. Last but in no way least, I am grateful to my family and friends for their unconditional support. i

5

6 ABSTRACT The present work explores the image completion problem in Multiview. More specifically, the aim is to reduce artifacts on the human figures in a scene (foreground), unlike other methods which target the background. Skeletal tracking is for the first time used to guide the image completion process, referred to as silhouette completion, as a tool to bypass occlusions. One reference viewpoint is primarily used in the view synthesis stage. The proposed algorithms are used as a complement to the general image completion methods the latter ones handle the background and the former ones the foreground. Two approaches are examined and implemented. An initial stage for both approaches, exploits the skeleton to detect any discontinuities in the figure. Parts that are not missing but filled with background pixels and, therefore, would not get inpainted by a conventional method, are now detected and a mask is formed there. In the first approach silhouette completion via inpainting an orientation term is introduced in priority computation to favor patches with structures parallel to the skeleton. In this way, appropriate structures can propagate and complete the figure naturally. In the second approach silhouette completion with one plus view the mask is formed in the same way as described above. However, it is filled in not via inpainting but with the corresponding information from a camera on the opposite side of the reference view instead. In this case, only the information contained in the mask is required to complete the silhouette. It can be an interesting alternative to the two-view approach in view synthesis in cases where the center of attention is on the human protagonists. Both approaches result in quality improvement. SC inpainting results to up to 5.4 db increase of PSNR, 10% increase of SSIM and with more than 70% more correctly reconstructed pixels compared to a general inpainting algorithm. SC-1plus-view manages to correctly reconstruct almost all (96%) of the pixels of the figure with up to 10dB increase of PSNR and 15% increase of SSIM compared to general inpainting. The general conclusion drawn is that integrating skeletons in the view synthesis process can in fact improve the visual quality of human figures in the virtual viewpoint. iii

7 iv

8 LIST OF TABLES TABLE 2.1 DEPTH MAPS IN MULTIVIEW A SYNTHETIC SCENE FROM BLENDER TABLE 3.1 STRUCTURAL IMAGE EDITING FROM SUN ET AL. [12] TABLE 3.2 STRUCTURAL IMAGE EDITING FROM BARNES ET AL. [11] TABLE 3.3 INTRODUCING DEPTH IMAGES IN INPAINTING [SOURCE: [10]] TABLE 3.4 INTRODUCING STRUCTURE TENSORS IN THE DATA TERM OF THE PRIORITY TABLE 3.5 THE DATA TERM OF BUYSSENS, DAISY ET AL. [SOURCE: [14]] TABLE 4.1 SKELETAL TRACKING CAN BYPASS OCCLUSIONS TABLE 4.2 SELECTION OF THE OPTIMAL SKELETON FROM ALL AVAILABLE VIEWS TABLE 4.3 THE JOINTS OF THE SKELETON NUMBERED TABLE 4.4 PROJECTION TO THE VIRTUAL VIEWPOINT TABLE 4.5 THE MISSING REGION IS EITHER MANUALLY SPECIFIED OR INDICATED BY THE MISSING PIXELS THEMSELVES TABLE 4.6 EXAMPLES OF TWO PROJECTIONS TABLE 4.7COMPARISON OF EXISTING METHODS WIDE BASELINE TRANSITION TABLE 4.8 COMPARISON OF EXISTING METHODS SMALL BASELINE TRANSITION TABLE 5.1 DETECTION OF EXPOSED PARTS OF THE SKELETON TABLE 5.2 MASK FORMATION STEP BY STEP TABLE 5.3 FILLING THE ENTIRE IMAGE TABLE 5.4 INPAINTING THE PRODUCED MASK WITH SC-INPAINTING AND GENERAL INPAINTING TABLE 5.5 INCLUSION OF THE TERM OF ( 3.8) IN THE PRIORITY COMPUTATION TABLE 5.6 DEPTH ESTIMATION STEP BY STEP EXAMPLE WITH ONE BONE TABLE 5.7 DEPTH ESTIMATION FOR THE ENTIRE FIGURE TABLE 6.1 THE PARADOX OF PSNR TABLE 6.2 REMOVING SMALL BREAKING LINES BY AVERAGING TABLE 6.3 MASK FORMATION STEP BY STEP TABLE 6.4 SC-INPAINTING BEFORE AND AFTER GENERAL INPAINTING TABLE 6.5 COMPARISON EXPERIMENT CAM4 TO CAM1 IN FRAME TABLE 6.6 THE MASKS USED BY SC-INPAINTING AND BY GENERAL INPAINTING TABLE 6.7 COMPARISON OF ORIGINAL AND ALTERNATIVE SC-INPAINTING EXPERIMENT CAM4 TO CAM1 IN FRAME 0 56 TABLE 6.8 DEPTH ESTIMATION FOR EXPERIMENT CAM4 TO CAM1 IN FRAME TABLE 6.9 MASK FORMATION STEP BY STEP EXPERIMENT CAM0 TO CAM3 IN FRAME TABLE 6.10 COMPARISON - EXPERIMENT CAM0 TO CAM3 IN FRAME TABLE 6.11 LOOKING ONLY ON THE MASK TABLE 6.12 COMPARISON OF ORIGINAL AND ALTERNATIVE SC-INPAINTING EXPERIMENT CAM0 TO CAM3 IN FRAME TABLE 6.13 DEPTH ESTIMATION FOR EXPERIMENT CAM0 TO CAM3 IN FRAME TABLE 6.14 ROBUSTNESS TO SKELETON ERRORS - EXPERIMENT CAM4 TO CAM1 IN FRAME TABLE 6.15 ADDING NOISE OF Σ=15 AND Σ=20 ON THE SKELETON TABLE 6.16 SILHOUETTE COMPLETION - EXPERIMENT CAM4 TO CAM1 IN FRAME TABLE 6.17 SILHOUETTE COMPLETION EXPERIMENT CAM0 TO CAM3 IN FRAME v

9 LIST OF IMAGES IMAGE 2.1 THE DEPTH MAP FORMAT... 7 IMAGE 2.2THREE COORDINATE SYSTEMS... 8 IMAGE 3.1 ISOPHOTES (IN RED) FOR ONE OUT OF TEN PIXELS. [SOURCES: [8] AND [23] ] IMAGE 3.2 THE FILLING PROCESS [SOURCE: [9]] IMAGE 3.3 NOTATION DIAGRAM [9] IMAGE 3.4 TEXTURE SYNTHESIS WITH CRIMINISI ET AL. [SOURCE: [9]] IMAGE 3.5 EXPLOITING CAMERA MOVEMENT [SOURCE: [15]] IMAGE 4.1 A TRACKING EXAMPLE WITH THE KINECT V IMAGE 4.2 UP TO SIX PEOPLE CAN BE TRACKED WITH THE SKELETAL TRACKING ALGORITHM OF KINECT IMAGE 4.3 THE PROBLEM OF THE UNDETECTED DISOCCLUSIONS ON THE FIGURE IMAGE 4.4 NOTATION DIAGRAM [SOURCE: [9]] IMAGE 4.5 SMALLER BASELINE TRANSITION VIRTUAL VIEW FROM CAM5 TO CAM IMAGE 5.1 THE PRODUCED MASK IMAGE 5.2 SUPER-GAUSSIAN FUNCTIONS IMAGE 6.1 PROJECTION FROM THE AUXILIARY VIEWPOINT ONTO THE DESIRED VIRTUAL VIEWPOINT IMAGE 6.2 THE MISSING REGION AS SEEN FROM THE AUXILIARY VIEWPOINT LIST OF ABBREVIATIONS BG FG SC SC-inpainting SC-1plus-view background foreground silhouette completion silhouette completion via inpainting silhouette completion with 1plus view vi

10 CONTENTS ACKNOWLEDGMENTS... I ABSTRACT... III LIST OF TABLES... V LIST OF IMAGES... VI LIST OF ABBREVIATIONS... VI 1 INTRODUCTION THE PROBLEM AIM THESIS ORGANIZATION MULTIVIEW STEREOSCOPY DEPTH-IMAGE-BASED RENDERING (DIBR) DEPTH MAPS COORDINATE SYSTEMS AND CAMERA PARAMETERS IMAGE WARPING INPAINTING REVIEW INTRODUCTION SOME MATHEMATICAL TOOLS [8] Isophotes Structure Tensors OVERVIEW OF STATE-OF-THE ART Introducing depth maps to inpainting Camera movement Structure Tensors in the data term A more robust data term Additional proposals PROBLEMS REMAINING IN THE LITERATURE PROBLEM FORMULATION SKELETAL TRACKING Advantages Optimal skeleton Using skeletons in practice DISOCCLUSION DETECTION AND MASK FORMATION FILLING IN BY SC-INPAINTING The inpainting problem Inpainting step by step Desired outcome FILLING-IN WITH 1PLUS VIEW GENERAL OVERVIEW PROPOSED SOLUTION vii

11 5.1 DETECTION OF DISOCCLUSIONS CREATION OF THE MASK SILHOUETTE COMPLETION VIA INPAINTING Priority Computation and orientation term Filling the entire image An experiment An Alternative (Depth estimation) SILHOUETTE COMPLETION 1PLUS-VIEW EXPERIMENTAL RESULTS EVALUATION METRICS PSNR as an evaluation metric Structure Similarity Index Measure C metric SILHOUETTE COMPLETION VIA INPAINTING Experiment cam4 to cam1 in frame Experiment cam0 to cam3 in frame Robustness against skeleton errors SILHOUETTE COMPLETION WITH ONE-PLUS VIEW Experiment cam4 to cam1 in frame Experiment cam0 to cam3 in frame OVERALL DISCUSSION CONCLUSION CONCLUSIONS OF THE THESIS PERSPECTIVES FOR FUTURE WORK REFERENCES viii

12 Introduction 1 INTRODUCTION 1.1 THE PROBLEM In novel multimedia applications, users are enabled to interactively navigate through different viewpoints of a scene look-around effect with the help of threedimensional information instead of just being passive viewers. [1] [2] The development of depth sensors has made the acquisition of the 3D information of a scene possible. A set of predefined viewpoints, frequently referred to as reference viewpoints, are actually getting captured while all other viewpoints on the navigation path, called as virtual viewpoints, are being reconstructed at the receiver. The process of rendering or synthesizing those virtual viewpoints is called view synthesis. [3]. Conventional acquisition systems switch abruptly between the viewpoints captured by the cameras while view synthesis can provide more comfort to viewers by creating a more graceful transition. [4] There exist two approaches to view synthesis: model-based rendering and image-based rendering. The former one explicitly reconstructs a 3D model of the observed scene whereas the latter one creates the virtual view directly in the image space. An image based rendering technique that uses the depth information to synthesize views is the Depth-Image-Based-Rendering technique (DIBR). DIBR is essentially the projection of pixels of the reference viewpoint into a virtual image plane by utilizing the information of the depth of the reference viewpoint. [5] [3] An inherent problem in DIBR and view synthesis in general is that information which is occluded (blocked, not visible) in the reference viewpoint may become visible in the virtual viewpoint. Those newly disoccluded regions are named disocclusion holes in the literature. [2] [4] A way in which these newly disoccluded regions can be covered without the demand of any further information other than the one a single reference viewpoint can provide is through an inpainting process. [6] Inpainting denotes the process of restoring or filling in missing or damaged data in an image. [7] [8] In inpainting the existing information of an image is used in order to complete its missing parts in a plausible way. [9] The research field of inpainting has been very active in recent years thanks to its numerous applications that include image restoration and text removal, object removal and of course the disocclusion handling in DIBR. [8] The main focus of the research in disocclusion handling in the context of DIBR has been the completion of the so-called holes in the image background (BG). The background is where the largest disocclusions holes occur due to the displacement of foreground objects in the virtual view. [10] The foreground (FG), however, can suffer from such artifacts, especially but not uniquely for sparsely camera-covered scenes where the displacement between reference and virtual viewpoint can be quite large. Foreground objects can either be 1

13 Silhouette completion in DIBR with the help of skeletal tracking partly occluded by themselves (self-occlusion) or by other FG objects. Plus, the FG is usually the most interesting element in a scene. Thus, in the present thesis the aim is to eliminate such artifacts on the FG and in particular artifacts on human figures present in the observed scene. People are a common subject in a variety of cases from sporting events and concerts to security cameras. 1.2 AIM The aim of this thesis is to resolve the problem of disocclusions on human silhouettes in the context of Multiview. In other words we want to complete the silhouettes, which is why we will be referring to this problem as silhouette completion (SC). Two were the key points that triggered the idea of this project: first the need to bypass occlusions and second the need to improve the visual quality of the human silhouettes specifically. That is how the idea of integrating the skeletal information came up. The knowledge of the pose of the person could be used as a guide to an image completion algorithm. To our knowledge it is the first time that skeletal tracking is integrated in the inpainting process. In the specific problem of human figure completion we would like to see the silhouette continued towards a certain direction. So in a way we want a structure propagation algorithm. Image editing algorithms that favor the continuation of certain lines and edges do exist in the literature [11] [12]. However, none of those is completely automatic; they always require human assistance to specify this line. In this project an innovative structure propagation algorithm for human silhouettes is introduced and it is fully automatic. Two approaches to silhouette completion are examined: the first uses exclusively the information the reference camera can provide (color and depth image along with the skeleton vector) and it is based on inpainting while the second selectively chooses some more information from another camera in addition to the reference camera (which remains the primary source of information). In the first approach (inpainting-based), every bone on the skeleton is handled separately. For each bone we first restrict our working area to a window around the bone and we then check for discontinuities in the silhouette. If any are detected, then a mask is formed, on which inpainting will be performed. In the inpainting stage the direction of the bone is used to prioritize patches with structures with similar directions to be inpainted first. A modification on the priority computation step guides the inpainting process to propagate structures towards the right direction. As an alternative, a depth estimation stage is inserted between the mask formation stage and the inpainting stage, always guided by the skeleton. We will subsequently test the effectiveness of this addition. Since this approach is based on inpainting it will be referred to as silhouette completion via inpainting or as SC-inpainting for short. 2

14 Introduction In the second approach, the mask formed as described above is filled in not via inpainting but with the corresponding information from another camera. A selection process is introduced to find the camera which contains the missing information and the skeleton is assisting once more to locate those corresponding points. Since this approach uses selective parts of the image from an additional camera apart from the reference camera it is named silhouette completion with 1 plus view or SC- 1PLUS-view. This name is a compromise between the one-view and the two-view approach in Multiview, which will be elaborated in chapter 2. The proposed methods (SC-inpainting and SC-1plus-view) do not aim to replace existing general inpainting algorithms. These algorithms would still be required to handle artifacts in the background, if that is desired. The silhouette completion methods of the present thesis aim to be a complementary to those algorithms that specialize particularly in human figures of an observed scene. The code required for the purposes of this thesis was developed by the author of this document apart from when it is otherwise stated. The main contribution of this thesis is that it examines for the first time the introduction of skeletal data in the process of view synthesis. Numerous ways to exploit the information of this data are examined and implemented. Briefly, skeletons are used a) for the detection of the self-occlusions or any disoccluded parts of the silhouette b) for the creation of an appropriate mask in those regions and c) either as a guide to an inpainting process or as an indicator for the required parts from the second camera view. Last, the demand for dense camera coverage of existing inpainting methods is relaxed here since we have a restricted area of interest. 1.3 THESIS ORGANIZATION Chapter 2 is an introduction to Multiview and DIBR. An overview of the state-of-the-art inpainting methods is provided in chapter 3 while in chapter 4 the problem is formulated. In chapter 5 the two proposed methods of silhouette completion are explained and experimental results are presented in chapter 6. Lastly, the conclusions of this thesis are discussed in chapter 7. 3

15

16 Multiview 2 MULTIVIEW 2.1 STEREOSCOPY Stereoscopy is a technique to create the illusion of three-dimensional depth in an image by means of binocular vision (stereopsis). In Greek στερεός means solid and σκοπέω means to look, to observe and the combination of the two creates the word στερεοσκοπία [13] The human brain perceives the distance to an object thanks to the slightly different viewpoints between the left and the right eye. Therefore, in stereoscopy we aim to present a slightly different image to each eye which will then be combined in the brain to five the perception of the depth. [3] 2.2 DEPTH-IMAGE-BASED RENDERING (DIBR) The development of depth sensors has inspired a great variety of applications in the field of computer vision such as three-dimensional video (3DV or 3DTV) and freeviewpoint video (FVV or free-viewpoint television - FTV). They allow the user to interactively control the viewpoint and generate new views of a dynamic scene from any 3D position. The interesting thing is that the focus of attention can be controlled by the viewers rather than a director, meaning that each viewer may be observing a unique viewpoint. This is the reason why three-dimensional video is considered to be the next milestone in video technology after high-definition video (HD). [3] [source: Stereoscopic or 3D video coding is required to display stereoscopic content. Multiview video coding (MVC) is a 3D video coding standard for video compression that efficiently encodes video sequences captured simultaneously from multiple camera angles. In a Multiview-video-plus-depth system, color images along with their corresponding depth images are utilized. Depth images show the distance of the camera from each point in a scene, also known as depth, and they are captured by depth sensors. 5

17 Silhouette completion in DIBR with the help of skeletal tracking The basic principle of a depth sensor is that having two cameras of known distance l we can calculate the distance d from an object by measuring the angles from the two cameras (triangulation): l = d tan (a) + d sin(a + b) sin(b) = d ( ) d = l (sin(a) tan (b) sin(a) sin(b) sin(a + b) ) ( 2.1) Table 2.1 Depth maps in Multiview a synthetic scene from Blender 2.69 Only a certain amount of views of a scene are actually captured and encoded. Every other possible viewpoint can be reconstructed by the existing original ones with a view synthesis technique. 6

18 Multiview There are two categories of view synthesis techniques: model-based rendering and image-based rendering [4]. The first category aims at reconstructing an explicit 3D model of the filmed scene from the multiview images. The quality of each synthetic view is directly dependent on the accuracy of this estimated 3D model. And high accuracy in turn relies on a very dense scene coverage which is not always possible or easy [4]. Image-based rendering, on the other hand, does not require the construction of an intermediate 3D model of the scene and it creates the virtual view directly in the image color space [4]. When the depth information is further exploited to synthesize views (use video-plus-depth) the rendering is called Depth Image-Based Rendering (DIBR). DIBR enables us to transmit a sparse number of views (color along with the corresponding depth) and efficiently synthesize all intermediate views. 2.3 DEPTH MAPS A depth map is a gray-scale image that represents the depth level in each pixel. A common depth format is the following (and the one used in our own experiments as well): 1 Z = Y 255 ( 1 1 ) + 1 ( 2.2) Z near Z far Z far where Y is the depth value in each pixel in the range of 0 to 255. (If the depth is encoded with a different amount of bits than 8 then we change the normalization factor accordingly). Z near and Z far are the minimum and maximum depth values in the scene, while Z is commonly used to denote the depth. So in a depth map formatted in such a way, we see the farthest 3D point as the darkest (value 0) and the closest point as the brightest (value 255 or equivalent maximum according to range): 255 Nearest point (source: wikipedia) Image 2.1 The depth map format 0 Farthest point 7

19 Silhouette completion in DIBR with the help of skeletal tracking 2.4 COORDINATE SYSTEMS AND CAMERA PARAMETERS Image 2.2Three coordinate systems Three coordinate systems are defined; world coordinates, camera coordinates and image coordinates (Image 2.2). As we can see the first system can have its origin at any point in the scene and it is independent of any particular camera. For the second one, the x-y plane is identical to the camera plane (also known as principal plane ), which is located on the optical center of the camera. The last system is a two-dimensional system (unlike the world and camera coordinate systems that are three-dimensional) and it corresponds to the image plane. The intersection point of the optical axis and the image plane is called principal point. The so-called intrinsic matrix A represents the transformation from camera to image coordinates. x c u f x 0 o x z c ( v) = A ( y c ) with A = ( 0 f y o y ) ( 2.3) 1 z c where (f x, f y ) is the focal length of the camera and (o x, o y ) is the principal point offset. It basically represents the displacement of the origin of the image coordinates from the principal point (so if they collocate it is (0,0) ). The extrinsic matrix E = [R t], which consists of a rotation matrix R and a transformation vector t, represents the transformation from world coordinates to camera ones. x c ( y c x w yw ) = R ( ) + t ( 2.4) z c z w 8

20 Multiview The intrinsic matrix A and the extrinsic matrix E make up the camera parameters. It is worth noting that they are both non-singular matrices (Det 0). 2.5 IMAGE WARPING The view synthesis problem is formulated as follows: Given a color image and its corresponding depth map from a certain viewpoint, called hereinafter as reference view, we wish to reconstruct the image (and possibly the depth map) in the virtual view. Forward warping is basically the mapping of each of the pixels from the reference view to the virtual view with the help of the reference depth information. This is why DIBR is also called 3D image warping in the literature. The basic idea is to first go from the image coordinates of the reference view (u r, v r ) to the world coordinates (x w, y w, z w ) that are common for all viewpoints and then from there obtain the image coordinates in the virtual view (u v, v v ). (u r, v r ) (x w, y w, z w ) (u v, v v ) ( 2.5) 1 st step: 2 nd step: x w ( yw ) = R 1 r (z c,r A 1 r ( zw u r z c,v ( v r u r v r x w yw ) t r ) ( 2.6) 1 ) = A v (R v ( ) + t v ) ( 2.7) 1 z w In the simplified case where the cameras are in 1D parallel arrangement (meaning that we only move along the u-axis) and assuming identical cameras in all views (so same focal length f and same rotation matrix) we have: u v = u r + f (t x,v t x,r ) + (o z x,v o x,r ) and v v = v r ( 2.8) where t x,v and t x,r are the horizontal components of the translation vector t of the virtual and the reference view respectively and o x,v and o x,r are the horizontal component of the principal point offset in these views. The term d = f (t x,v t x,r ) z + (o x,v o x,r ) = f l z + du is called disparity of the projection, a term which is sometimes used to describe the displacement of the pixels between the views in more general cases, as well. 9

21 Silhouette completion in DIBR with the help of skeletal tracking Backward warping is implemented in VSRS (view synthesis reference software) to make view synthesis more robust to depth errors. Its first step is a classical forward warping as described above but instead of projecting the color image, the depth image is projected in the virtual view. This step is followed by blending and hole-filling of the projected depth map at which point we have a complete depth map without gaps. And then the corresponding color pixels are mapped in the virtual view by using the disparities provided by the depth map projection. 10

22 Inpainting Review 3 INPAINTING REVIEW 3.1 INTRODUCTION Image inpainting, also known as image completion or filling-in, refers to filling in the missing or damaged parts of an image or a video with the most appropriate data in order to obtain harmonious and hardly detectable reconstructed regions. This is a very important topic in image processing, with applications including image and video coding (e.g. Multiview Video Coding MVC and Image-Based Rendering), wireless image transmission (e.g. recovering lost blocks), special effects (e.g. removal of objects or persons), and image restoration from scratches or text overlays. [14] [8] Image inpainting is an ill-posed inverse problem that has no well-defined unique solution. To solve the problem, it is therefore necessary to introduce image priors. Inpainting methods are guided by the assumption that pixels in the known and unknown regions in the image share the same statistical properties or geometrical structures. This assumption translates into different local or global priors, with the goal of having an inpainted image as visually natural as possible. [8] There exist many types of inpainting methods from which the main ones will be described in this section. Hereinafter, the known part of the image is commonly denoted as Φ and the missing part, also referred to as mask or hole, as Ω, which is how it is established in the literature. [9] [10] [15] The first category of methods, known as diffusion-based inpainting, introduces smoothness priors via parametric models or partial differential equations (PDEs) to propagate (or diffuse) local structures from the exterior to the interior of the hole. Many variations of these methods exist using different models (linear, nonlinear, isotropic, or anisotropic) to favor the propagation in particular directions or to take into account the curvature of the structure present in a local neighborhood. These methods are naturally well suited for completing straight lines, curves, and for inpainting small regions. However, they are not well suited for recovering the texture of large areas, which they tend to blur. [8] The second category of methods, known as texture synthesis, exploits image statistical and self-similarity priors. They rely on the assumption that the image contains random stationary textures and homogeneous regular patterns [16]. The texture to be synthesized is learned from the known parts of the image (Φ). Learning is done by sampling, and by copying or stitching together patches (called exemplar) taken from the known part of the image. The corresponding methods are known as exemplar-based techniques. [8] They use the information provided by geometry (PDEs, structure tensors etc.) to find the optimal patch exemplar from the rest of the image. In other words, they combine exemplars with geometric approaches and that is quite efficient. In fact, exemplar-based methods prevail in the literature of inpainting methods in multiview scenarios. 11

23 Silhouette completion in DIBR with the help of skeletal tracking Sparsity priors have also been considered in inpainting (sparsity based methods). The image is assumed to be able to be sparsely represented on a given basis [17]. There is a variety of bases tested in the literature, such as DCT, wavelets, framelets, curvelets, etc. [18] [19] [20] [21]. Depending on the actual content of the image (edges, textures, ) the choice of the dictionary can be crucial. These methods can show impressive results when the missing part is small, thin or sparsely distributed over the image (e.g. noise) and there is natural redundancy present. However, they fail in reconstructing large disocclusions areas like the ones present in Multiview Imaging. [8] [14] Images and videos are not the only kind of multimedia where inpainting finds use. 3D surfaces obtained from range scanners frequently have missing parts mainly due to occlusions or even lack of sufficient coverage of the object by the scanner. So the resulting 3D model is incomplete. [22] Lastly, it should be noted that there are automatic and semi-automatic inpainting methods. In semi-automatic methods human assistance is required to either define the region to be filled-in (object removal methods like [9])and/ or to specify lines which have to be preserved during the inpainting (structural image editing methods like [12] and PatchMatch [11]). Original image After removing the pumpkin Intermediate result Final result Table 3.1 Structural image editing from Sun et al. [12] Original image Hole and constraints Final result Table 3.2 Structural image editing from Barnes et al. [11] As it can be observed, semi-automatic methods are very efficient and can propagate structures in a very convincing manner. However, they are unsuitable for real-time processes. 12

24 Inpainting Review 3.2 SOME MATHEMATICAL TOOLS [8] ISOPHOTES Isophotes are isolines of the intensity image function meaning that on the isophotes the intensity of the image (or image patch) is constant. Their directions at a given pixel p are given by the normal of the gradient vector computed on p. We denote the gradient of a point p as I p and we also denote the gradient of a patch of the image I centered on p in the same way. I p I p = I p x x + I p y y = x I p ( y ) ( 3.1) Where I p is the gradient of I x p in the x direction and I p is the gradient in the y direction. y So the isophote direction is expressed as I p or I p where: I p = I p y x + I p y ( 3.2) x Image 3.1 Isophotes (in red) for one out of ten pixels. [sources: [8] and [23] ] 13

25 Silhouette completion in DIBR with the help of skeletal tracking STRUCTURE TENSORS Another way to mathematically denote the local geometry is a structure tensor. Given a one-channel image I the structure tensor computed for each pixel p is: G = I p I T p = ( I 2 p x ) + ( I 2 p y ) ( 3.3) Where I p is the gradient and I p T is the transpose vector of the image gradient at p. In the case of an RGB image (3 channels) the scalar structure tensor is computed for each channel separately and then they are added: 3 G = ( I p ) i ( I p T ) i i=1 ( 3.4) Again this being the structure tensor at point p. 3.3 OVERVIEW OF STATE-OF-THE ART Criminisi et al. [9] is a seminal work in exemplar-based inpainting, which has inspired a lot of research including this present project. The core of the algorithm is an isophote driven image sampling process which is based on two key observations. The first is that if the patch to be inpainted lies on the continuation of an image edge then the most likely matches will lie along the same edge. The second observation is that the filling order is crucial. This means that instead of starting inwards going out we should start from the contours of a sharp edge and propagate it. So priority should be given to regions that lie in the continuation of image structures. This is the big advantage of the method of Criminisi over previous onionpeel approaches. Onion peel Criminisi et al. Image 3.2 The filling process [source: [9]] 14

26 Inpainting Review The region filling algorithm is as follows. Image 3.3 Notation diagram [9] Ω: missing region Φ: source region δω: contours of Ω or fill front Ψ p : patch of the image n p : unit vector perpendicular to δω at the point p I p = ( I p, I p y x ): the direction of the isophote at point p The Ψ p s of δω contain parts that exist (from Φ) and parts that are missing (from Ω). For each patch Ψ p in δω we want to find one in Φ that best matches the existing texture within δω. We find it and replace. The size of the patches should be specified each time to be slightly larger than the largest distinguishable texture element in the source region Φ. Each pixel is given a confidence value C(p), which in the beginning of the algorithm is 1 for pixels in Φ and 0 for pixels in Ω. 1, p Φ C(p) = { 0, p Ω ( 3.5) The algorithm consists of three steps: 1. computing the patch priorities of patches on the fill front and select patch to be filled in 2. finding best match and propagate information and 3. updating confidence values and continue with next patch 15

27 Silhouette completion in DIBR with the help of skeletal tracking Priority computation favors patches which are on the continuation of strong edges and contain a lot of reliable information (high confidence pixels). It is, therefore, the product of two terms: the patch confidence term and the data term: With confidence term: P(p) = C(p) D(p) ( 3.6) C(p) = 1 Ψ p C(q) q Ψ p (I Ω) ( 3.7) And data term: D(p) = I P n p α ( 3.8) Where: Ψ p is the area of Ψ p in pixels, α is a normalization factor (e.g., a=255 for a typical grey-level 8-bit image), n p is a unit vector orthogonal to the front δω in the point p, When all priorities are computed the patch Ψ p with the highest priority is selected to be filled in. In order to find the best match for it we calculate a distance metric between Ψ p and all patches in Φ. In most studies in the literature (as well as the present one) the distance metric is simply the sum of squared differences of the already filled pixels in the two patches. And once we have the most similar patch we replace. Ψ q = arg min Ψq Φ {d(ψ p, Ψ q )} ( 3.9) Where: d(a, b) = (a(i, j) b(i, j)) 2 i [1,k] j [1,k] assuming that a patch is kxk. Then all the missing pixels in Ψ p are replaced by their corresponding ones in Ψ q. And finally these pixels get a new confidence value: Here is an overview of the algorithm in pseudocode: Extract the manually selected initial fill front δω 0 Repeat: 1. Identify fill front δω t. 2. If Ω t = then exit. 3. Compute priorities P(p), p δω t. C(p) = C(q ) p Ψ p Ω ( 3.10) 16

28 Inpainting Review 4. Find the patch Ψ p with maximum priority. 5. Find the exemplar Ψ q that minimizes d(ψ p, Ψ q ). 6. Copy the image data from Ψ q to Ψ p p Ψ p Ω. 7. Update C(p), p Ψ p Ω. Image 3.4 Texture synthesis with Criminisi et al. [source: [9]] This algorithm is a general inpainting algorithm. It was originally produced to solve the problem of object removal in images and not specifically for DIBR. It has however inspired works that address the inpainting problem in DIBR such as [10] [15] INTRODUCING DEPTH MAPS TO INPAINTING In Image-Based Rendering we have the additional information of the depth. Points with a similar depth are more likely to be part of the same object or in any case have more in common. So we can use depth along with the RGB components to compute the similarity of patches in the patch matching stage of the algorithm. Plus, a very insightful observation of Daribo and Pesquet-Popescu [10] is that since disocclusions are caused from the displacement of FG objects revealing some BG areas, then filling in the holes with BG pixels is more sensible. So, as a first step the depth map is projected to the virtual viewpoint as well as the texture image and a simple inpainting algorithm is performed on it to fill its holes. As it will be seen below, the depth info of the disoccluded regions is needed. A simple isotropic diffusion such as the one from Bertalmio et al. [7] can be used to inpaint the depth holes. The corresponding depth patch of a (color) patch Ψ p is denoted as Z p and a similar correspondence exists for Z p, Z q. An extra term is added in the priority computation in [10] which takes into account the depth information: P(p) = C(p) D(p) L(p) ( 3.11) 17

29 Silhouette completion in DIBR with the help of skeletal tracking L (p) is called the level regularity term and is represented by the inverse variance of the corresponding depth of the patch, Z p. L(p) = Z p Z p + q Z (Z p (q) Z ) 2 p Φ p ( 3.12) Where Z p is the area of the depth patch (in pixels) (of course same as Ψ p ), Z p (q) is the depth of pixel q in patch Z p and Z p is the pixel mean value in the patch. Furthermore, the patch matching is done like this: Ψ q = argmin Ψq Φ{d(Ψ p, Ψ q ) + β d(z p, Z q )} ( 3.13) Where β controls the importance given to the depth information. Alternatively, the best matching candidate is: With d(a, b) = k=r,g,b,z a k a b 2 Ψ q = arg min Ψq Φ {d(ψ p, Ψ q )} ( 3.14) And a k = 1 for k = R, G, B and a k = 3 for k = Z in order to have the depth channel as important as the color channel. It is noted that this way foreground patches are not prevented from being chosen but they are seriously penalized as candidates. Experimental results showed that the performance and visual quality improved compared to Criminisi s algorithm. Disocclusion regions to fill in Criminisi s inpainted Daribo s inpainted Table 3.3 Introducing depth images in inpainting [source: [10]] 18

30 Inpainting Review CAMERA MOVEMENT As it is observed in [10] the holes should indeed be filled in with information of the BG rather than the FG. However, in the context of multiview imaging, some constraints can be added to improve the visual result and the performance of the inpainting. Specifically, assuming that we have a toward-right camera movement then the disoccluded regions will appear on the right of the FG object that was previously occluding them. The same will be true for a toward-left camera movement as well. The objective is to prevent structure propagation from all other sides (top, bottom, left and additionally from FG regions) and just propagate structure from the right side of the hole. Thus, the patch priorities are calculated only along the right border of the hole and on all other directions it is set to zero. This means that the left border of the mask will be filled in at the very end of the process. [15] Image 3.5 Exploiting camera movement [source: [15]] STRUCTURE TENSORS IN THE DATA TERM The data term can be reinforced by using a structure tensor instead of a gradient. In this tensor the depth information provided in view synthesis can be included as well. [15] [24] As mentioned above, the structure tensor for an RGB image is given as: G = ( I p ) i ( I p T ) i i=r,g,b ( 3.15) Now, the depth information Z is also used to compute the tensor: 19

31 Silhouette completion in DIBR with the help of skeletal tracking 3 G = ( I p ) i ( I p T ) i i=r,g,b,z ( 3.16) In this way both color and geometric structure is favored. Afterwards, the tensor is smoothed with a 2D Gaussian kernel G σ = 1 exp ( x2 +y 2 ) to 2πσ 2 2σ 2 be more robust against outliers, noise and local singularities: J s = J G σ ( 3.17) Then we get its eigenvalues λ 1, λ 2 and eigenvectors v 1, v 2 and compute the data term which now is a better, more robust representation of the local orientation of a patch: C D(p) = α + (1 α)exp ( (λ 1 λ 2 ) 2) ( 3.18) Where C is a constant positive number and α [0,1]. The eigenvalues of the structure tensor define the amount of structure variation in the image or the patch. Isotropic regions do not favor any direction (λ 1 λ 2 ) and the data term is low, while at parts that contain strong structures the data term is high (λ 1 λ 2 ). The eigenvectors define an oriented orthogonal basis. v 1 is the orientation with the highest fluctuations (orthogonal to the image contours) and v 2 gives the preferred local orientation. The eigenvector with the smallest eigenvalue indicates the isophote direction. Image from the Ballet sequence of the Microsoft 3D Video dataset after warping (from cam5 to cam4) Inpainting result with [15] Table 3.4 Introducing structure tensors in the data term of the priority 20

32 Inpainting Review A MORE ROBUST DATA TERM A more robust data term for the priority computation step is proposed in the work of Buyssens, Daisy et al. [14]. In Criminisi s algorithm the data term (Equation ( 3.8)) assumes that the gradient I p of the target point represents the maximum gradient in the non-masked part of the patch. However, this leads to a high data term not only on the exact location of the image contours that should be extended but also for pixels of the fill front whose distance from the contours is less than or equal to the patch size. In addition, the main drawbacks of other proposed data terms are presented in [14]. Namely: They require the use of extra parameters that need to be inserted manually, which can be impractical (especially in the context of Multiview). They do not take into account the normal vector n p to the fill front δω, which can attribute a high priority to patches with gradient tangent to δω (such as the ones in LeMeur et al. [24] and in Gaultier et al. [15]). And plus, they can be very computationally and time expensive as is the case for sparse-based data terms. So the proposed data term of [14] is as follows: T D(p) = G p n p with G p = w q I q I q q Ψ q Φ ( 3.19) where w q is a normalized 2D Gaussian centered on p. Local configuration inpainting Original D(p) as presented in [Criminisi] The proposed D(p) Table 3.5 The data term of Buyssens, Daisy et al. [source: [14]] The second proposal in the work of [14], concerns the lookup strategy for best match candidates. They have basically adapted the Patch Match algorithm [11] (also used in Adobe Photoshop) for exemplar-based inpainting instead of matching patches from one image to patches of another one, which is which is the reason for their creation in the first place. Using a Patch-Match based method in the lookup strategy results in better heuristics and faster convergence. 21

33 Silhouette completion in DIBR with the help of skeletal tracking ADDITIONAL PROPOSALS Patch blending is another way to improve the visual quality of an inpainting process. [14] We can understand the meaning of patch blending if we think that exemplar-based inpainting methods simply copy patches chunks in an image and paste them on another. This causes some block effects artifacts to be created. Other works introduce a prior of low visual saliency in the extrapolation process [25]. The best match candidate has both a small matching cost and low visual saliency as the BG tends to have low saliency. Furthermore, there is the approach of [26], where depth and color images are jointly inpainted. After the best candidate for the match is selected, the variance of the corresponding depth patch is copied to the target depth patch. Last, in [27] self-similarity is redefined as nonlocal recurrences of pixel patches within the same image across different scales. A segmentation process is applied in the depth image and self-similarity is calculated within each depth layer. Then, in the patch matching stage suitable patch candidates are selected within the specified scale range. 3.4 PROBLEMS REMAINING IN THE LITERATURE All of these proposals have managed to introduce and develop the concept of inpainting and inpainting for DIBR in particular. However, there are certain limitations to the existing proposals. They are briefly summarized below. All methods target artifacts of the BG without taking into consideration problems that might occur in the FG as well. They mostly rely on dense camera coverage where only small disocclusion holes appear. In cases where they have to handle large disocclusion areas they fail to create a plausible result. Existing methods are applied to the mask of the missing points. However, in many cases certain parts that are covered would benefit from inpainting as they are not properly filled. More specifically, when an object is projected to the virtual viewpoint it might suffer from self-occlusion holes. Moreover, we know that the displacement of BG pixels is larger than the one of the FG points due to their different depth values. Consequently, a self-occlusion hole could be covered with points of the BG and therefore not be identified as missing and get inpainted on. The present thesis attempts to ameliorate the above limitations by a) focusing on the foreground and in particular on human silhouettes which in many cases are the center of attention in a scene, b) by restricting the area of interest, we will attempt to operate on sparser camera coverage to produce more plausible representation of human silhouettes and c) we will directly address the problem of undetected self-occlusion holes with the help of skeletal data. 22

34 Problem Formulation 4 PROBLEM FORMULATION The novelty of the present thesis is the introduction of skeletal tracking data in the process of DIBR in order to ameliorate the visual quality of human silhouettes in the foreground of a scene. This entails an efficient detection of self-occlusions or disoccluded parts of the silhouette in general and a visually plausible method to complete them. In order to detect disocclusions that occur on the silhouette we exploit the information that a skeletal tracking algorithm provides. Skeletons allow us to divide the silhouette into smaller parts (bones) which can be separately addressed. They, also, provide the opportunity to detect disconnected parts of the silhouette. If such a disconnection (occlusion) exists, a suitable mask must be formed around it in order to get it filled. Subsequently, we test two approaches to fill in the masks: silhouette completion via inpainting (SC-inpainting) and silhouette completion with 1 plus view (SC-1PLUS-view). As it will become clear in the later sections, it is observed that dividing the image and analyzing each bone at a time can facilitate the mask formation process as well as the proposed inpainting method, SC-inpainting. Thus, at certain parts of this project the bones will be separately analyzed and at other the entire figure will be edited as a whole. In the next sections of this chapter, we analyze the basic features of skeletal tracking and what we expect from the disconnection detection method, the mask formation method and the two approaches of silhouette completion (SC). 4.1 SKELETAL TRACKING The input of skeletal tracking algorithms is a depth image It at time t. the output is a 3D skeleton θt, which is a vector of N values representing the body configuration for each person in the corresponding input image. A basic algorithm for skeletal tracking, which is also implemented in the Kinect sensor is the one described in [28]. 23

35 Silhouette completion in DIBR with the help of skeletal tracking The input of a skeletal tracking algorithm is a single depth image. The output is a vector with the 3D coordinates (x,y plus the depth information) of specific parts of the silhouette called joints. (source: Image 4.1 A tracking example with the Kinect V2 Two intermediate representations are used between It and θt. First, the body parts image Ct which stores a vector of N probabilities at every pixel indicating the likelihood that the world point under that pixel is each of the N standard body parts. And secondly, the vector of the joint hypotheses Jt, which contains triples of hypotheses (body part, 3D position and confidence) with say 5 hypotheses per body part. Finally, joint hypotheses are searched for kinematically consistent skeletons. [28] ADVANTAGES An important advantage of this method is that only the last stage of the pipeline uses information from the previous frames so the tracking is quite independent for each frame. In this way the system recovers fast from tracking errors present in previous solutions. The characteristics of skeletal tracking that encouraged us to try to integrate it in the filling-in process are the following: It can recover the skeleton even if some parts of the body are occluded in the field of view due to motion and kinematic constrains as we see in Table 4.1. It is fully automatic: it automatically detects that there is a human present in the scene and automatically configures the pose. It is real-time: no additional post-processing is required and nor is it timeconsuming. It does not require pose initialization (Kinect v2). It works with more than one figure present in the scene. Thus, we are not restricted to scenes with only one person present. The Kinect v2 sensor works with up to 6 people in the field of view. 24

36 Problem Formulation Image 4.2 Up to six people can be tracked with the skeletal tracking algorithm of Kinect2 (source: Become-The-Incredible-Hulk) The right upper arm and the shoulder are completely occluded The right side of the man is only partly visible Part of the left leg is getting blocked by the right one Table 4.1 Skeletal tracking can bypass occlusions (source: OPTIMAL SKELETON Skeletons produced in Kinect 2 can indicate the probability with which each joint is predicted. When we have multiple cameras filming a scene, we can exploit the fact that some cameras can have a better view of certain areas than others. This means that if at time t the skeleton is predicted more accurately from a certain camera than any other camera then this skeletal information will be used for all viewpoints. So in this project we assume that we have an accurate skeleton even if there are occlusions in the color and depth reference images. 25

37 Silhouette completion in DIBR with the help of skeletal tracking Confidence 1 Skeleton with the highest confidence: Confidence 2 Confidence 3 A 3D scene filmed by 3 different viewpoints The skeletons produced by each camera The corresponding confidence values Choice of the best skeleton Table 4.2 Selection of the optimal skeleton from all available views 26

38 Problem Formulation USING SKELETONS IN PRACTICE The data of the skeleton are in the form of a vector that contains the x, y and depth coordinates of each key point of the skeleton called joints. The order of the joints is fixed: 1: head 2: neck 3, 4: right and left shoulder 5, 6: right and left elbow 7, 8: right and left wrist 9, 10: right and left hand 11: waist 12: basin 13, 14: right and left hip 15, 16: right and left knee 17, 18: right and left ankle 19, 20: right and left foot Table 4.3 The joints of the skeleton numbered (1, 2, 11, 12: in the middle, 3, 5, 7, 9, 13, 15, 17, 19: right side, 4, 6, 8, 10, 14, 16, 18, 20: left side) This structure of the skeleton is consistent with the skeletal data we get from the Kinect tracking algorithm. The skeletal data are known in the original (reference) viewpoint and in order to also acquire them for the virtual viewpoint all we have to do is to calculate the disparity of each point (joint) on the 2D plane by using the depth coordinate. d = f (t x,v t x,r ) z + (o x,v o x,r ) = f l z + du ( 4.1) In the course of this project it is more sensible to refer to bones instead of joints. So, after loading the data of the joints, we use the known connections of the joints to form a matrix called Bones. Bones is a 19x6 matrix and each of its rows represents one of the 19 bones of the skeleton. And each bone is defined by the coordinates of its corresponding joints. So for each bone i we have: Bones(i, : ) = [x i 1 y i 1 z i 1 x i 2 y i 2 z i 2 ] ( 4.2) 27

39 Silhouette completion in DIBR with the help of skeletal tracking 4.2 DISOCCLUSION DETECTION AND MASK FORMATION In the beginning, we get our projection to the virtual viewpoint. The input consists of the color and depth image and the skeletal data in the reference viewpoint along with the camera parameters of the two cameras (reference and target viewpoint). Camera Parameters of reference and target viewpoint joints ref 20x6 joints virt 20x6 Table 4.4 Projection to the virtual viewpoint In the initial step of the procedure, we would like to use the skeletal data to determine the locations of the disoccluded parts on the figure, if any. This is not as simple as in all other inpainting algorithms. Existing inpainting algorithms either specify the mask manually (object removal algorithms [9] and structure propagation algorithms Sun et al. [12]) or they assume that the holes left after the projection to the virtual view is the missing region Ω ( [10] and [15]), as seen in Table

40 Problem Formulation Table 4.5 The missing region is either manually specified or indicated by the missing pixels themselves [source of first 4 images: [14]] There are cases, however, where there is no clear indication as to where the problem in the image lies. This is because, as Equation ( 2.8) indicates, points with a low depth value, as the BG points have, will have a larger displacement in the virtual image plane than those with a high depth value, like FG points. Thus, self-occlusion holes may be covered by the BG, especially when we have a large change of viewpoint. So we need to detect them. Image 4.3 The problem of the undetected disocclusions on the figure 29

41 Silhouette completion in DIBR with the help of skeletal tracking Keeping in mind that we have the depth information and the skeletal information available, it is possible to detect whether there are parts of the skeleton that are located on a part of the image that corresponds to the BG. This key observation is the base of the proposed self-occlusion detection method. It seeks parts of the skeleton that are exposed, in other words that lay outside the figure. Now, if we detect that the skeleton is exposed then a mask must be formed around it to serve as the missing region Ω. The size for the mask must be calculated automatically. It is observed, however, that a long bone would require a large hole to be formed around it while for a shorter bone a smaller one would suffice. Thus, there is a relationship between bone length and size of hole. This is why it was decided to form a mask for each bone separately, i.e. for each bone we will work on a reduced region of the source image a window. Thus, the full mask would consist of the collage of all smaller masks. So far we have analyzed the problem of the detections of occlusions in the foreground. In the following sections of this chapter we formulate the problem that the proposed algorithms are called to resolve. 4.3 FILLING IN BY SC-INPAINTING THE INPAINTING PROBLEM An image to be inpainted can be mathematically defined as a (two variable) function I: I R 3 (color image) where I defines the image domain. Ω is the masked part of the image (part to be inpainted), and δω is the boundary of the mask (also known as fill front). The source part of the image is denoted as Φ, where Φ = I Ω. A patch Ψ p centered on the pixel p which is located on the fill front contains pixels from both Φ and Ω. Image 4.4 Notation diagram [source: [9]] The objective of exemplar-based algorithms is to find which patch should have priority to be filled in and then find its optimal match from existing patches. This happens for all patches in Ω until the image is filled hopefully in such a way that it looks visually coherent and physically plausible. 30

42 Problem Formulation In most of the existing algorithms in the literature, priority computation favors patches that are in the continuation of sharp image edges. Strong image structures get propagated in this way. In addition, the patch matching process selects candidates that are similar to the existing information in the Ψ p patch. Let it be noted that the algorithm of Criminisi et al. [9], which is a milestone in the field of inpainting, was created for texture synthesis in a context of object removal. Since then, it has moved on to being used for the disocclusions in DIBR as well. If an object from the FG is removed and reveals a BG with a certain texture, this texture will be propagated in the mask and naturally complete the image. However, in natural images we have two problems: a) we do not always have clear edges to propagate and b) we cannot always assume that the existing orientation vector of a patch is the way that we want our structure to propagate. The present study makes use of the Microsoft Research 3D Video Dataset [29] which was originally created for the work of Zitnick et al. [30]. The following images are examples of the synthesized views of the Ballet sequence of Microsoft s dataset. In the first image the virtual viewpoint is on the left of the reference viewpoint while in the second one on its right. Virtual view from cam2 to cam5 Virtual view from cam5 to cam2 Table 4.6 Examples of two projections In the first one (towards-left camera movement) we see that the leg of the girl is disoccluded while in the second one (toward-right camera movement) it is her arm. We know the natural way of continuation of her limps and torso. But the computer does not. The edges in the existing parts of her limps are not only towards the right direction but also in other directions. And even those edges that do have the right direction are not necessarily as clear and sharp as we would like them to be. 31

43 Silhouette completion in DIBR with the help of skeletal tracking Inpainting of [9] Inpainting of [10] Inpainting of [15] Table 4.7Comparison of existing methods wide baseline transition (source: Therefore, with the application of conventional inpainting methods we get the results of Table 4.7. Even in cases where the algorithm is depth-aware the human figure is not adequately completed since the areas where the arm and torso have to be propagated are covered by background pixels; the inpainting algorithm does not even get applied there. In small-baseline transitions we could have better results in the sense that the projection holes are located at a more convenient place i.e. where we want to see the figure continued. So certain advanced methods like [15] can give good results. But still other advanced methods produce unnatural images. [9] [10] Image 4.5 Smaller baseline transition virtual view from cam5 to cam4 32

44 Problem Formulation Inpainting of [9] Inpainting of [10] Inpainting of [15] Table 4.8 Comparison of existing methods small baseline transition (source: In this project we will investigate whether having the additional information of the 2Dplus-depth coordinates of some key points of the human silhouette, i.e. joints, can help us perform a better inpainting job INPAINTING STEP BY STEP The basic steps of inpainting algorithms in the context of DIBR are the following: 1. Determining source region Φ and target region Ω 2. Computing priorities for each patch p Ω for pixels located in the fill front: P(p) = C(p) D(p) or P(p) = C(p) D(p) L(p) ( 4.3) 3. Selecting target patch Ψ p to be filled-in in the current iteration Ψ p p = argmax P(p) p Ω ( 4.4) 33

45 Silhouette completion in DIBR with the help of skeletal tracking 4. Similarity computation between Ψ p and all patches in Φ commonly by Sum of Squared Differences or variation of it: d(ψ p, Ψ q ) = (Ψ p (i, j) Ψ q (i, j)) 2, q Φ i [1,k] j [1,k] ( 4.5) Where Φ denotes the whole of the source region and k implies the size of the patch (kxk). 5. Select best candidate Ψ q q = argmax d(ψ p, Ψ q ) q Φ ( 4.6) 6. Replace missing areas of Ψ p with corresponding ones of Ψ q 7. Replace confidence values of missing areas of Ψ p with corresponding ones of Ψ q, 8. Repeat until Ω =. C(p) = C(q ) p Ψ p Ω ( 4.7) DESIRED OUTCOME What we demand from our inpainting algorithm is to incorporate the direction of the skeleton in the patch prioritization process so that limps will propagate in the right direction. We propose the following alterations in the steps of the inpainting algorithm. For the stage of the priority computation (2 nd step) we can use the orientation of the bones n bone to guide the inpainting in a way similar to the orthogonal-to-δω vector n p in most other algorithms. However, instead of favoring structures perpendicular to the fill front we would favor those that are parallel to the bone. The need for the knowledge of the orientation of the bone leads us again to the onebone-at-a-time approach as the one used in the formation of the mask. In this way, a different search space will also be defined for each bone. This is beneficial for the inpainting too, in the sense that there is no need to look very far from a part of the figure to find a suitable patch. Best candidates to fill in a missing part of a leg will be located in the region adjacent to that leg. An additional benefit would be that the matching process will require less time. In fact, the speed improvement is huge. It takes at least 5 to run the algorithm with the entire image as search space and only ~30 seconds to run the algorithm when using those windows. 34

46 Problem Formulation Thus, by inpainting one bone at a time we achieve: a) the extraction of the bone orientation so it can be used as a guide b) the elimination of unlikely patch candidates and c) speeding up the patch matching process. These proposals are explained in detail in section of the silhouette completion via inpainting, section 5.3. We should point out that this inpainting algorithm can only be implemented in the presence of a human figure in the scene and only in the mask that it has itself created. For all other holes created by the projection to the virtual viewpoint a general inpainting algorithm should be used, since it makes no sense to favor structures parallel to the skeleton anywhere else other than close to the skeleton itself. It is a complementary to a general inpainting method. Silhouette completion via inpainting handles disocclusions on the human figures of the FG while general inpainting handles the disocclusion holes of the BG. The question that arises is which should be applied first the SC-inpainting or the general inpainting algorithm? If SC-inpainting is performed first then the risk is to have parts of the holes created by the projection as part of the inpainted area. However, if general inpainting is applied first, the risk is to have some falsely reconstructed pixels which will then be part of the known region Φ when we go and apply the SC-inpainting. So mistakes might be propagated. The suggested compromise to the problem of the ordering of the inpainting processes is to perform the silhouette and the general inpainting independently and produce I SCinp and I gen respectively. Afterwards, the areas that are marked as missing in the image produced by the SC-inpainting, I SCinp will get filled-in by the corresponding parts of I gen. 4.4 FILLING-IN WITH 1PLUS VIEW In the approach of silhouette completion with 1plus view, the mask (formed as described in section 5.2) is filled in not via inpainting but with the corresponding information from another camera. The selection of the camera which contains the missing information is obvious: we must select the camera that is closest to the virtual viewpoint but is located on the opposite side of the reference camera. The important thing is to locate the points needed to complete the figure. The skeleton will be the tool that will assist in the selection of the right points. The difference from the two-view approach of view synthesis is that SC-1plus-view only makes use of (and transmits) certain parts of the second camera view instead of the entire view. This can be useful in applications where a) the BG is unimportant or b) uniform or c) will be replaced by something else, like a synthesized image (for special effects). Otherwise, the disocclusions in the BG can by all means be filled-in by a general inpainting algorithm to achieve the completion of the entire image. 35

47 Silhouette completion in DIBR with the help of skeletal tracking In this chapter we have presented a comprehensive description of the problems addressed in the present thesis. In the next chapter we will attempt to analyze in detail the solution proposed to resolve them. 4.5 GENERAL OVERVIEW Here is an algorithmic overview for the silhouette completion. The filling-in of the mask can be performed with either SC-inpainting or SC-1plus-view. Load Iref, Idep, joints_ref Project to virtual view get Iv, IvD, joints 1 For each bone: o Define source region window Iv small o Detect exposed parts. If Exposed_Skeleton = continue o Form mask Ihole small o Fill in mask get I_SC small o Transfer to full image I_SC End for 1 This part of the code is an adaptation of a kind contribution of Ms. Ana de Abreu. 36

48 Proposed solution 5 PROPOSED SOLUTION In this chapter, the four key processes of the proposed solution will be analyzed in detail: the detection of disocclusions that occur on the figures, the mask formation, the silhouette completion via inpainting and the silhouette completion with 1plus view. 5.1 DETECTION OF DISOCCLUSIONS Firstly, we create a straight line that connects the two joints, the bone (binary image Ibone). Next we will detect if and where a problem exists. The availability of depth information is a very useful tool at this stage. Depth enables us to differentiate whether a pixel belongs to the FG or BG. The simplest way to classify a point as BG or FG is by thresholding. Here we denote the threshold value for the depth as Dthr. After recreating the skeleton from the joints and the known connectivities between them we can scan each point and see if it is part of the BG or not. If indeed a point of the skeleton is at a location in the image where the depth indicates the BG then we can conclude that it is exposed. (By depth we mean of course the depth map of the image and not the 3 rd coordinate of the joint). (x s, y s ) Skeleton: if Idep(x s, y s ) < Dthr then (x s, y s ) Exposed_Skeleton ( 5.1) The blue dot is located to a point that corresponds to a depth value equal to 119. The red dot s corresponding depth value in the depth map is 50. The exposed parts of the skeleton: Table 5.1 Detection of exposed parts of the skeleton 37

49 Silhouette completion in DIBR with the help of skeletal tracking A threshold value Dthr for separating BG from FG is chosen automatically. We do not even need to apply a sophisticated FG-BG segmentation algorithm once we have the skeleton data. Given the 3D coordinates of the joints we can see which levels of depth correspond to the figure. We allow a certain margin and then set the threshold for the FG-BG segmentation. We look at the depth values of all the joints of the skeleton. Then we find the smallest one and subtract a margin value from it. The result of the subtraction is Dthr. min _depth = min (jointsz) ( 5.2) Dthr = min _depth margin ( 5.3) Surely, there can be more sophisticated methods to do a BG-FG separation like [31] but for our purposes this simple algorithm suffices. Besides, most existing BG-FG separation algorithms do not take into account the information of the tracking algorithm and therefore are needlessly more complicated. Then, for each pixel of the bone that is located on a point with depth higher than the threshold Dthr we know that it is still in the foreground (so in the body). But for those that are not we know that they are now located on points corresponding to the BG and so they are exposed. Iskel_outofbody is a binary image that indicates the parts of the skeleton that are exposed, as seen in Table 5.2 of the next section. 5.2 CREATION OF THE MASK If we detect that the skeleton is exposed then we create a hole around the exposed part of the bone with the help of mathematical morphology and in particular the dilation operation. The size for the hole must be calculated automatically. It is observed, however, that a long bone would require a large hole to be formed around it, while for a shorter bone a smaller one would suffice. Thus, there is a relationship between bone length and size of hole. This is why it was decided to form a mask for each bone separately and then the full mask would consist of the collage of each smaller mask. So for each bone we will work on a reduced region of the source image a window. The structure element used for the dilation is a disk. And the radius of the disk is something that should change according to the length of each bone the entire bone, not only the exposed part. From tests performed in the pilot stage of this thesis it was observed that a simple linear relationship between length of bone and size of radius is not the optimal choice. Something that increases in a smoother way is closer to what we want. So the logarithmic function with the base of 2 is used: 38

50 Proposed solution disk_radius = log 2 ((lenght_of_bone + 1) 3) ( 5.4) However, we don t want to inpaint on points of the image that belong to the foreground. So after the dilation, we check the depth value at each point in the created mask. All the points that have Depth > Dthr are excluded from the mask since they are considered to belong in the FG. So the dilation is performed and after the exclusion of FG points the mask is produced. The complete bone The exposed part of the bone The final mask built for this bone Table 5.2 Mask formation step by step To sum up, we have a mask that covers some BG points and is located around the exposed bone. Image 5.1 The produced mask 39

51 Silhouette completion in DIBR with the help of skeletal tracking 5.3 SILHOUETTE COMPLETION VIA INPAINTING PRIORITY COMPUTATION AND ORIENTATION TERM Our highest priority in this project is to propagate the body parts in the right direction. As it has been already demonstrated in the literature, the filling order is critical in the quality of the output. Therefore, structures parallel to the skeleton should be prioritized over all others. A way to mathematically express this is inspired by the data term of [9]. We get first the patch gradient I p and then the gradient normal I p by rotating by 90 o. The gradient normal I p indicates the direction of the isophote in the patch [9]. Since we want to know if it is in the same direction as the bone, we compute the magnitude of the inner product of the isophote direction I p and the bone direction n bone. Where O(p) = I P n bone + α ( 5.5) n bone = (joint1 x joint2 x, joint1 y joint2 y ) ( 5.6) α is a small positive constant value (here 0.001). If an exact analogy with the data term, that is defined in [9] had been used (Eq.( 3.8)), then in the case the two vectors were perpendicular the priority of the patch would be reduced to zero. However, in this way these patches would potentially have a chance at getting prioritized if they had a high confidence value. Actually in the Matlab implementation of the algorithm of [9] the data term is also defined as D(p) = I P n p + α for the same reason. ( Now the resulting priority could be one of two things: P(p) = C(p) D(p) O(p) or P(p) = C(p) O(p) ( 5.7) Does it make sense to keep both the data and the orientation term? The orientation term has been developed to favor patches that have an edge parallel to the bone ( n bone ). The data term, on the other hand, favors patches with a strong edge perpendicular to the contours of the mask ( n p δω). In this project the contours of the hole is of lower importance than the direction of the bone, which actually indicates the correct direction for the body to continue. So the second option is chosen but they are both examined in section to see if the existence of the data term can make any difference. In this way we will favor patches with an orientation parallel to the orientation of the bone to get inpainted first. Consequently, this is a way to preserve a certain structure through our inpainting and to attribute the structure propagation characteristic to the algorithm. 40

52 Proposed solution FILLING THE ENTIRE IMAGE Afterwards, all the inpainted parts take their place on the full image, Table 5.3. Table 5.3 Filling the entire image Below is an algorithm overview in pseudocode: Load Iref, Idep, joints_ref Project to virtual view get Iv, IvD, joints 2 For each bone: o Define source region window Iv small o Detect exposed parts. If Exposed_Skeleton = continue o Form mask Ihole small o Apply SC-inpainting get I_SC small o Transfer to full image I_SC End for The stages of the SC-inpainting follow the steps of the inpainting method of Criminisi described in section but with a modified stage of priority computation AN EXPERIMENT An experiment was performed in the simple case of one bone instead of the entire skeleton, in order to determine two things: a) if the algorithm can indeed claim the attribute of structure propagation and b) whether keeping the data term of Equation ( 3.8) has any effect on the priority after the introduction of the orientation term. 2 This part of the code is an adaptation of a kind contribution of Ms. Ana de Abreu. 41

53 Silhouette completion in DIBR with the help of skeletal tracking The mask of Image 5.1 was filled first with the proposed method and then with Criminisi s: The reference color image projected in the virtual view Ground truth Proposed method Method of [9] Table 5.4 Inpainting the produced mask with SC-inpainting and general inpainting It is worth noting that if we just performed Criminisi (or any other algorithm) without setting up the missing region as we did, the arm would not be continued but rather left exactly as it is in the projected image. The reason why the two algorithms were tested on the produced masked region is to investigate whether the orientation term is what is actually helping the body to connect or if just developing a mask and running an existing algorithm would work too. However, we see that just defining a better masked region is not enough. On the same mask the one algorithm has managed to connect the arm with body but the other one has not. This enables the algorithm to claim the attribute of structure propagation. We must conclude that the combination of finding the region in which to inpaint and biasing the algorithm towards a specific direction is what makes the difference. 42

54 Proposed solution Next we perform the same test with the inclusion of the data term in the priority computation. The result can be observed in Table 5.5. Table 5.5 Inclusion of the term of ( 3.8) in the priority computation Again we get a satisfactory result but it is clear that the inclusion of the extra term does not offer anything significant. So from now on the priority will only have two terms: The patch size used in these experiments is 17x17. P(p) = C(p) O(p) ( 5.8) The above was only a simplistic experiment. In the complete method the arm has two bones upper and lower arm as we have explained above. In the next chapter the complete skeleton is being set to the test AN ALTERNATIVE (DEPTH ESTIMATION) So far the inpainting algorithm proposed was biased in the priority computation step in order to follow the directions of the skeleton to complete the figure. Next we decided to try and exploit the skeletal information a bit further and get an estimation of the depth map in the exposed parts. This was inspired by [10] where the depth is used to improve the original algorithm of [9]. In this way images might be able to be completed in a way that is closer to the ground truth. The idea is that missing patches should be more likely to be filled-in by patches of similar depth than simply patches that happened to appear close in the image plane. In the work of [10] they use simple linear interpolation to get an estimate of the depth image in the missing regions. Here given the knowledge of the pose, it is possible to get an estimation that is more accurate than a simple linear interpolation. 43

55 Silhouette completion in DIBR with the help of skeletal tracking Since the direction of each bone is known we could try linear interpolation in the region Ω ALONG this direction. Ideally, if we have an exposed bone which is surrounded by parts of the FG on both ends, then the depth on the missing region would resemble the depth of the first end of the bone on that side and smoothly turn into the something like the other end as we move along the bone. However, the missing region is not always surrounded by the foreground. For example, if we have two consecutive bones and the missing region spreads over both of them we would end up with something that fades into the depth of the BG in the intersection of the two bones. However, we would like something that fades into the depth of the existing end of the second bone. So instead of working with straight lines along the bones, we propose the following method to estimate the depth. We start with the two joints, (x joint1, y joint1, z joint1 ) and (x joint2, y joint2, z joint2 ), which always have known depth values z joint1 = depth 1 and z joint2 = depth 2 either from the skeletal tracking method or from the depth image itself. We connect them with a straight line the bone. Each point of the bone gets the weighted mean of the depth of the two joints as its depth value: DEPTH_estimated(i bone, j bone ) = dist 1 depth 1 + dist 2 depth 2 dist 1 + dist 2 where dist 1 and dist 2 are the distances of the point from the two joints: p(x, y) bone: dist 1,2 = (x x joint1,2 ) 2 + (y y joint1,2 ) 2 Then for all other points in the mask (i, j) we call another function which finds the closest point on the bone p bone (to the point in question (i, j)) and the corresponding minimum distance d min. Basically we want the depth value of (i, j) to be high if (i, j) is close to the bone and low if it is far from it. If d min = 0 then the point is on the bone and Depth(i, j) = Depth(p bone ). As d min increases the depth value drops (according to a Gaussian bell curve) until we reach the edge of the mask where the depth is again known, Depth(p outofmask ). Depth(i, j) = (Depth(p bone ) Depth(p outofmask )) e d min 3 2σ 3 + Depth(p outofmask ) The parameter σ represents the dispersion that we want for our estimation. If we want to have high depth values only for points very close to the bone then we need a low σ. If we want the depth to be kept high for points further away from the bone then we need a high σ. So the value of σ should depend on the size of the bone. For longer bones like torso and upper or lower legs we need a higher σ than for the arms or feet for example. In our code the dispersion σ is defined as: 44

56 Proposed solution σ = log 2 (2bone_length + 1) where bone_length = (x joint1 x joint2 ) 2 + (y joint1 y joint2 ) 2 + (z joint1 z joint2 ) 2 The bell curve that we use is actually a super-gaussian function with N=3. The common Gaussian function (N=2) is not as steep as those with higher values for N, with an increase in steepness as N increases. Super-Gaussian functions: g(x) = A exp ( x N 2σ N) Image 5.2 Super-Gaussian functions Let us illustrate this process better with an example ( cam0 to cam3 frame 21 ). Original depth image The mask The bone Estimated depth image 45

57 Silhouette completion in DIBR with the help of skeletal tracking Table 5.6 Depth estimation step by step example with one bone The projected depth image illustrating the mask, the skeleton and the locations of the joints of the man The estimated depth map Table 5.7 Depth estimation for the entire figure It is true that this way, a new parameter has been added that affects the resulting depth image. The width of the estimated limp depends solely on the dispersion σ. The algorithm is modified as follows: After the detection function (which is the same one as the 1 st version) we get an estimate of the depth in the region specified by the mask. Then, we use the color image and the estimated depth image as input to our new inpainting function. The difference of this function from the one in version 1 is that it takes the depth into account in the similarity computation and patch matching stage. As a small reminder, in the original patch matching stage we want to find the patch of the known region Φ that is most similar with the known part of the current to-be-filledin patch Ψ p. So we calculate a distance measure that is the sum of their square per-pixel difference of all three channels (RGB image) and Ψ p gets replaced by the patch Ψ q that has the minimum distance. 46

58 Proposed solution Ψ q = arg min Ψq Φ {d(ψ p, Ψ q )} Where: d(a, b) = ( (a(i, j) b(i, j)) 2 k=r,g,b i [1,k] ) assuming that a patch is kxk. j [1,k] Now the error takes depth into account by incorporating it in the distance metric as in [15]: Ψ q = arg min Ψq Φ {d(ψ p, Ψ q )} With d(a, b) = (a k (a(i, j) b(i, j)) 2 ) and a k = 1 for k = R, G, B and k=r,g,b,z i [1,k] j [1,k] a k = 3 for k = Z in order to give the depth channel the same importance as the color channel. Overview of the algorithm for each bone: Define source region window Iv small Detect exposed parts. If Exposed_Skeleton = continue Form mask Ihole small Estimate depth IvD small Apply SC-inpainting (v.2) get I_SC small Transfer to full image I_SC This alternative depth-aided version of SC-inpainting is the same as the one described in section except that it has a modified patch matching stage such that the depth is taken into account. 47

59 Silhouette completion in DIBR with the help of skeletal tracking 5.4 SILHOUETTE COMPLETION 1PLUS-VIEW The question that arises is how we would know which points of the second or auxiliary image to transmit. The skeleton is again the tool that we will use. We already have a way to detect discontinuities of the figure and form a mask over it. So all there is to do is make sure that the appropriate information is extracted from the auxiliary camera. In this case, after the mask formation, a separate work with each bone is not necessary. All the points in the produced hole will be processed together. In the beginning, we project the reference view into the virtual view (get Iv and IvD) and load the skeleton as before. Then, we must select the most appropriate camera (or cameras) of all the rest. Here we only work with one auxiliary camera because the cameras that captured our dataset were located on the same height and more or less from on a similar distance from the scene. The camera that is selected is the one that is closer to the virtual view but on the opposite side of the reference one. After projecting the auxiliary camera onto the virtual view (get Iv aux and IvD aux ), as well, we create the mask for the silhouette as described in chapter for the SCinpainting. Next, the valid points of the mask are detected, meaning the points on the virtual view plane that correspond to the FG. Those points form the image complement which comes and fills (completes) the image Iv. An overview of the algorithm in pseudocode is as follows: For each bone: Define source region window Iv small. Detect exposed parts. If Exposed_Skeleton= continue Form mask Ihole small End for Transfer all Ihole small into full size mask Ihole Get image complement from the second reference view Combine image complement and Iv get I_SC1plus 48

60 Experimental Results 6 EXPERIMENTAL RESULTS Detection of disoccluded parts of the figure and formation of a suitable mask are the initial stages of either approach to the problem and are theoretically developed and implemented in Matlab for the purposes of the thesis. The first approach, SC-inpainting, is structured similarly to the basic inpainting algorithm of Criminisi with the following alterations: a) the inpainting is performed for each bone of the skeleton at a time (whenever needed) in a restricted window of the image, b) the patches with structures parallel to the bone are assigned high priority in order to complete the figure by following the directions of the skeleton. The core of the algorithm is based on the Matlab implementation of Criminisi s algorithm accessed through GitHub. The theoretical analysis of this algorithm and the Matlab implementation of the alterations on the basic inpainting algorithm were developed by the author of this document. The same is true for the alternative SC-inpainting algorithm with the depth estimation stage as well as the SC-1plus-view approach. 6.1 EVALUATION METRICS PSNR AS AN EVALUATION METRIC In the literature of inpainting in Multiview PSNR is widely used as an objective evaluation metric of the result. However, there are doubts that the PSNR is the best objective measure to evaluate the result of an inpainting algorithm. The whole point of inpainting is to create something that looks natural in the human eye. We might not have the actual data but we want something that LOOKS plausible. Especially in our case, the target is to complete human silhouettes. This means that we want to solve the problem of having parts of the body floating out of the rest and get a connected figure. However, we might connect a limp e.g. an arm to the rest of the body but not EXACTLY as it is. It might be a bit more to the left or a bit to the right, or it might be a bit darker or lighter than the ground truth. Plus our objective is to reconstruct as best as possible the FG of the image. The mask that we have created is not restricted to FG. This does not mean that the result looks unnatural. So checking pixel per pixel between inpainted image and ground truth might not be the best measure to evaluate the method. Here are two examples that illustrate the paradox of the PSNR. 49

61 Criminisi Silhouette completion in DIBR with the help of skeletal tracking osed method final result Criminisi PSNR = dB PSNR = dB Ground truth PSNR = dB PSNR = dB PSNR = dB PSNR = dB PSNR = dB Ground truth PSNR = dB Table 6.1 The paradox of PSNR 50

62 Experimental Results On the first set of images (ballerina) we don t see a great improvement on the first column compared to the second column. The ballerina s arm doesn t get properly propagated. However, the PSNR is ~1.5dB higher. On the second set we see that the man s arm is well propagated and connected to the rest of the body in the image of the first column. Also his leg has a better color at least a more convincing one. And yet the PSNR turns out to be lower than the second image. In order to get a more complete evaluation, two more metrics are going to be used STRUCTURE SIMILARITY INDEX MEASURE A metric that better takes into account the nature of inpainting and is used in some papers of the literature, like in [5], is the Structure Similarity Index (SSIM). This metric is introduced in the work of [32]. SSIM is an image quality metric that assesses the visual impact of three characteristics of an image: luminance, contrast and structure. It is in fact the product of three terms: SSIM(X, Y) = l(x, Y) α c(x, Y) β s(x, Y) γ ( 6.1) where X and Y are the images that are being compared, l is the luminance term, c is the contrast term, s is the structure term and α, β and γ take here their default value 1. Luminance term: Contrast term: Structure term: l(x, Y) = 2μ Xμ Y + C 1 μ X 2 + μ Y 2 + C 1 ( 6.2) c(x, Y) = 2σ Xσ Y + C 2 σ X 2 + σ Y 2 + C 2 ( 6.3) s(x, Y) = σ XY + C 3 σ X σ Y + C 3 ( 6.4) C 1, C 2 and C 3 are constants and take their default values C 1 = , C 2 = and C 3 = C 2 /2. μ X, μ Y, σ X, σ Y and σ XY are the local means, standard deviations and cross-covariance of the images X and Y. 51

63 Silhouette completion in DIBR with the help of skeletal tracking C METRIC Lastly, since we are interested in the artifacts of the foreground, we are introducing another metric: the number of correctly (within a threshold) reconstructed pixels that belong to the foreground. More specifically, the locations of the FG pixels in the masked area are extracted from the ground truth depth image. Then at each location (x, y) we check if: Inpainted_Img(x, y) Original_Img(x, y) < epsilon ( 6.5) where epsilon is a threshold parameter. If so then we consider the pixel as correctly reconstructed. In the end, we count the number of the correctly reconstructed ones as a percentage of all the foreground pixels that have been inpainted and this becomes our new metric C. q = (Depth GroundTruth > Depth threshold )AND(Image mask == true) ( 6.6) So q contains the indices of the points that belong to the foreground and have been detected and inpainted by our method and q is the number of non-zero elements in q. And then: C = 1 q Inpainted Img Original Img < epsilon ( 6.7) For the threshold parameter epsilon we will begin with epsilon = 15 on the grounds that we have an RGB image and we can tolerate ±5 in each color component for the result to be natural. 6.2 SILHOUETTE COMPLETION VIA INPAINTING EXPERIMENT CAM4 TO CAM1 IN FRAME 0 We project the view of camera 4 onto the viewpoint of camera 1 (in frame 0). An averaging function is applied to fill in all the breaking lines (both on FG and BG). Table 6.2 Removing small breaking lines by averaging 52

64 Experimental Results Next we use SC-inpainting to connect the figure (FG) and Criminisi for all the rest of the missing region (BG). Skeleton Exposed skeleton Mask Table 6.3 Mask formation step by step After SC-inpainting (the patch size used After SC-inpainting for FG and general was 11) inpainting for BG Table 6.4 SC-inpainting before and after general inpainting 53

65 Silhouette completion in DIBR with the help of skeletal tracking proposed inpainting Criminisi inpainting ground truth PSNR = dB PSNR = C metric = 81.28% C metric = 7.86% SSIM = SSIM = Table 6.5 Comparison experiment cam4 to cam1 in frame 0 The figure manages to connect completely and the image foreground seems completely natural. The parts of the BG that get inpainted with Criminisi have been filled in quite strangely since it is not a method specialized for DIBR. However, it doesn t matter really what method we apply since we focus on the FG and we measure all metrics only on the mask that we produce, Mask silh and not on the large missing regions in the BG, Mask gen. Plus, Mask gen and Mask silh don t really have an intersection. 54

66 Experimental Results mask produced for SC-inpainting Mask silh Mask used for general inpainting Mask gen Table 6.6 The masks used by SC-inpainting and by general inpainting So when we use only the general method, nothing gets applied on the regions that are interesting for us. But for any intersection they have or for any intersection any images that we test have, we are presenting the final results and the metrics for the image completed on both FG and BG. So we see in Table 6.5 that we have an increase of 5.4dB in PSNR when using silhouette completion via inpainting. The C metric is high (more than 70% higher) which is normal because C metric measures the percentage of correctly reconstructed points of the FG and, as we said, Criminisi doesn t get applied on regions where we expect to see the FG. The SSIM, which is evaluated again only on the mask that we is elevated as well (~10%). Next we take a look at how the alternative method (with the depth estimation stage) performs. We can see that it doesn t manage to outperform the original approach. The two results are actually very similar. 55