Stereo+Kinect for High Resolution Stereo Correspondences

Stereo+Kinect for High Resolution Stereo Correspondences Gowri Somanath University of Delaware gowri@udel.edu Scott Cohen, Brian Price Adobe Research {scohen,bprice}@adobe.com Chandra Kambhamettu University of Delaware chandrak@udel.edu Abstract In this work, we combine the complementary depth sensors Kinect and stereo image matching to obtain high quality correspondences. Our goal is to obtain a dense disparity map at the spatial and depth resolution of the stereo cameras (4-12 MP). We propose a global optimization scheme, where both the data and smoothness costs are derived using sensor confidences and low resolution geometry from Kinect. A spatially varying search range is used to limit the number of potential disparities at each pixel. The smoothness prior is based on available low resolution depth from Kinect rather than image gradients, thus performing better in both textured areas with smooth depth and textureless areas with depth gradient. We also propose a spatially varying smoothness weight to better handle occlusion areas, and the relative contribution of the two energy terms. We demonstrate how the two sensors can be effectively fused to obtain correct scene depth in ambiguous areas, as well as fine structural details in textured areas. 1. Introduction Recent years have seen increasing popularity of 3D content and sensors, and many commercial and consumer level capture and display devices are stereoscopic. Many applications are in entertainment, where existing high resolution still or video sensors have been extended for stereo capture. Example applications include consistent segmentation, object extraction, depth manipulation and view synthesis. The focus of our work is to obtain high quality stereo correspondences for such multi-mega pixel stereo images/videos towards the above applications. The two broad challenges we face are in the scene and the scale of the problem. The bane of stereo is matching in ambiguous regions with low texture or repeated textures, which abound in natural scenes. The second challenge is the computational complexity due to the large image size (4-12 Mega-Pixels (MP)) and disparity range (200-500 integer disparity levels). Gowri Somanath was supported by Adobe Research for this work. Figure 1. Top: The setup used in our experiments. Bottom: DSLR image of a sample scene and Depth map from Kinect, where the black indicate areas with no depth reading from Kinect. In this paper we address the challenges through fusion of stereo image matching with a complementary depth sensor, such as Kinect. Though Kinect does not suffer from ambiguity in low or repeated textures, its spatial and depth resolution is at least an order of magnitude lower than commonly used cameras for the applications above. On the other hand, a calibrated stereo setup using high resolution cameras can provide higher depth resolution. Figure 1 illustrates the complementary nature of the two sensors. The repeated chessboard pattern and the single colored boards form a challenge for stereo matching. Areas with low reflectance, such as the black wall, cannot be resolved by Kinect. Both Kinect and stereo matching can obtain depth estimates in regions with non-ambiguous texture, such as those on the toys and cloths. However, the low depth resolution of the Kinect does not recover fine structural details. We thus propose a stereo algorithm that combines the information from the stereo RGB images, and the low resolution depth information from Kinect, to obtain a dense disparity map at the resolution of the stereo images. Global optimization based algorithms have often been found most suitable to obtain smooth and dense depth maps essential for target applications. However, they suffer from large memory and computation time requirements, and fail to scale well for large disparity volumes from 4-12MP images and 200-500 disparity levels. In our experiments on 1

such large problems, alpha expansion [1] on a traditional stereo formulation took several hours to converge. Previous works have only been reported with images less than 2MP. We obtain depth maps of high spatial and depth resolution through a global optimization framework. The confidence of the sensors, and the low resolution geometry information from the Kinect are used to derive both the data and smoothness cost in our energy function. The main contributions of our work is through the introduction of a framework to obtain high resolution correspondences through fusion of stereo matching and Kinect. Using the depth estimates from Kinect and confidences of both sensors, the data cost is calculated at only a sparse set of labels. Also the matching cost combines the image consistency with the geometry prior available from Kinect. The smoothness prior is based on the combined depth and image gradient information. This offers a significant advantage over the traditional use of image gradient in both textured and non-textured areas. For example, there can be geometrically smooth regions with strong texture gradients, or a single colored object with a smooth depth gradient. We demonstrate how our proposed changes to the data and smoothness terms can effectively combine the advantages of the two systems and recover the depth better than either sensor in isolation. In addition the sparse labels reduces the time and memory complexity for alpha expansion, thus converging orders of magnitude faster. As our primary goal is to obtain correspondences at the resolution of the stereo images, our scheme is in line with the growing trend towards scalable algorithms for large disparity volumes. The image resolution facilitates spatial resolution, while the large number of disparity levels aids better reconstruction of geometry. While most works are restricted to 1-2 MP, we make a large jump towards 4-12MP images. The large spatial and depth resolution gap introduces challenges in registration of the two systems, leading to errors in re-projection/alignment and sparsity ( 10% of stereo image pixels have depth information from Kinect). The proposed scheme has been designed to handle these effects, and can be easily adapted to any other source of depth. 2. Related Work The stereo literature is vast. Here we discuss related works, under three categories, that have employed alternate depth sensors in combination with single or stereo cameras. The first set of schemes obtain multiple samples from a single moving depth sensor to improve the accuracy or density [8, 2, 10]. Structure from Motion and tracking is used to register multiple scans. Though the schemes can provide high quality depth map as a final result, there is no clear relation between the final accuracy and the number of samples required. There is an inherent assumption that the sensor provides a depth at every point in the scene, which may not be true for certain surface colors/materials. The second category of work combines a depth sensor with a color image from a higher resolution camera for super-resolution [3, 14, 9]. Color information is used to obtain the final depth map by up-sampling, assuming image edges to be potential depth edges. Though the above methods obtain spatial super-resolution, there is no increase in depth resolution or accuracy. The third group of work combines a depth sensor with a stereo camera system [5, 15, 13, 12] and is the closest to our line of work. Similar to the papers discussed above, most of these works assume that fairly accurate depth information is available for each surface in the scene. In [5], a 160 120 pixel Photonic Mixer Device (PMD) is used with 640 480 pixel stereo camera. Intensity from the PMD is used to determine a binary confidence mask. The value from the PMD is used at pixels of high intensity, while the remaining pixels are matched through stereo. Thus there is inherently no increase in depth resolution even in textured areas. In [15], a 144 176 pixel Time-of-Flight (ToF) sensor is combined with a 320 240 pixel stereo setup. Belief Propagation is used for fusion. The data term is formed from a weighted linear combination of stereo matching costs and depth from ToF. Unlike our scheme, matching costs are calculated at all disparities at each pixel and the weights are not based on sensor confidences. In [13], a 160 120 ToF sensor is combined with a 1024 760 stereo camera. The ToF depth is up-sampled using joint bilateral filtering, and a confidence map is obtained based on the ToF signal strength at each pixel. A stereo confidence map is computed based on local image features. A final weight map is derived as a product of the two, and used to combine the respective cost volumes. The final depth map is determined by a greedy approach. Though the method employs sensor confidence for fusion, the lack of a global optimization can lead to noisy results in regions where both sensors have low confidence. Recently, [12] proposed a global stereo scheme using sparse ground control points (GCPs), which are high confidence depth values obtained from sparse image matching or depth sensors. Like some of the earlier techniques, the depth value at a GCP is assumed to be correct and there is no provision for using sensor confidence. The first two terms are standard data and smoothness terms from color consistency. The third term is the GCP energy. The sparse depth at GCPs is interpolated using an adaptive propagation algorithm. The GCP energy is setup to penalize disparity assignments that diverge from the interpolated value at each pixel. Since a scalar parameter is used to control the deviation, it can lead to wrong estimates in surfaces where GCPs are absent or very sparse, resulting in invalid interpolation. In addition, the high computational cost required the authors to resize the images to less than 2MP. In the next section we provide an overview of how we

overcome the above limitations, followed by details of the individual components and results. 3. Proposed Method The goal of our work is to fuse the information from the Kinect and traditional stereo matching to leverage the advantages of both. That is, obtain depth through Kinect in ambiguous regions and increase depth resolution and precision from stereo matching of the high-resolution images. We combine the information using an energy minimization approach. Given the set of image pixels P and labels L = {L 1, L 2,..., L max } corresponding to disparities, an image labeling f assigns a disparity f p L to each pixel p P. The final labeling is obtained by minimizing the following energy using alpha expansion and Graph Cuts [1] E(f) = p P D(p, f p ) + (p,q) N V (p, q, f p, f q ). (1) The data cost D(p, f p ) is the cost of assigning label f p to pixel p. The smoothness cost V (p, q, f p, f q ) is the cost of labeling neighboring pixels p and q as f p and f q. Stereo matching algorithms have used this framework before. The data cost D(p, f p ) is traditionally computed using the color difference between the left and right images, and the smoothness term V (p, q, f p, f q ) is computed using the color similarity of neighboring pixels. We enhance both the terms in this model by including information from the Kinect. The data cost D(p, f p ) is improved in two ways. First, the number of possible labels is greatly reduced to values around the possible disparities indicated by Kinect. Second, the labels can be biased directly to be similar to the Kinect output. This helps in areas such as the flat planes in Figure 1 because the stereo color information is completely ambiguous while the Kinect gives fairly good depth within its precision limits. It also significantly accelerates the algorithm since fewer labels are considered at each pixel. Details are given in Section 3.2. The smoothness term V (p, q, f p, f q ) is set up to combine Kinect and image information in two ways. Our first modification is use of an image combining color and depth information to derive the neighborhood costs, and it has two effects. First it allows supression of non-depth edges in the image. We do so by using depth discontinuities suggested confidently by Kinect instead of differences in color to decide where not to require smoothness in the disparity labeling. Many images have color edges that do not correspond to depth discontinuities such as the checkerboard pattern in Figure 1, and these edges can confuse the stereo algorithm. The Kinect depth map, correctly, does not register these edges as depth discontinuities. Unfortunately, the depth discontinuities produced by the Kinect are not spatially precise and are at a lower resolution. We align the Kinect depth edges to the high-resolution image color edges. The second advantage of using the combined image is the introduction of depth gradients not visible in flat colored regions. The traditional formulation may miss depth gradients in textureless regions where no color difference is observed. We use the confident Kinect disparities to guide the label gradients. Our second modification to the smoothness term is that we use a spatially varying relative weight between the data and smoothness cost to better handle occlusion and alignment error as detailed in Section 3.3. 3.1. Setup and Initialization Before describing the details of our framework, we discuss the setup and calibration process. We use two Canon EOS 7D cameras, with the Kinect mounted as shown in Figure 1. The Kinect depth map and stereo images are 640 480 and 2592 1728 pixels, respectively. Figure 1 shows a sample scene and the Kinect depth values. The scene shows various cases where the sensors complement each other. Kinect does not provide depth readings on the the black walls and some parts of the table, as indicated by the black regions. The single colored boards, and the chessboard pattern are challenging for stereo matching. The objects such as the toys and textured cloths are regions where both stereo and Kinect perform fairly well. As our results will demonstrate, our scheme can obtain more details in the last type of region given the high resolution images. We are able to guide disparity optimization in texture-less areas using the Kinect information, and obtain the depth gradients on the boards. The stereo pair is calibrated using [11] and rectified using [4]. To calibrate the Kinect and Stereo co-ordinate system, we capture multiple calibration board images from the IR sensor and stereo. Given the individual intrinsic and extrinsic parameters we reconstruct 3D points in the Kinect co-ordinate system, and corresponding corners in the stereo images. To transfer the valid Kinect depths to the stereo images, we first estimate the rotation, translation and scale that align the two 3D point clouds. We found that large image alignment errors (re-projection error onto stereo) can arise if the 3D point alignment error is optimized. Since Kinect accuracy is a function of surface distance, orientation and reflectance, so is the error. Techniques have been proposed to reduce errors due to distance, but it is not always possible to correct those due to orientation or reflectance. In order to make the algorithm general and avoid scene specific calibrations, we adapt our stereo algorithm to handle the alignment errors. However, we also found that such errors can be minimized by using a dense set of pose and orientations of the calibration board, and estimating the transformation by optimizing the re-projection error onto the stereo images. Once the Kinect 3D points are transformed to the stereo co-ordinate system, we project them onto both the left and right views to obtain the disparity map. Figure 2 shows

where α, β = 1 α are scalar weights. C I, C G are truncated costs measuring color and gradient difference as follows: C I (p, f p ) = min(τ c, I r (p) I l (p f p ) ). C G (p, f p ) = min(τ g, G r (p) G l (p f p ) ). By integrating the Kinect information, we improve this in two ways. First, we limit the search range of possible disparities at pixel p to a set L p which is a range around local Kinect estimates. Second, we directly bias the stereo disparities f p to be similar to the Kinect disparities. This is useful in texture-less and ambiguous regions with a repeating pattern. Thus we define our data cost as { D D(p, f p ) = tr (p, f p ) + ρc K (p, f p ), if f p L p., otherwise. (3) (a) (c) (d) Figure 2. (a) Binary map indicating the pixels in the stereo image where we have depth values from Kinect. (b) Nearest-neighbor up-sampling of the transferred depth values from Kinect. (c) and (d) show a small part of the image and the corresponding depth values (in color) from Kinect prior to up-sampling. White area indicates missing values. (Best viewed on computer screen) the transferred depth onto the sample scene. Figure 2(a) shows a binary map, where the dark pixels indicate those where the Kinect depth value was transferred. Figure 2(c-d) shows a 300 320 pixels region, and illustrates the sparsity of the transferred values. Note that the density of the Kinect transferred depth is non-uniform and varies based on the distance, surface orientation and the reflectance of the surface. Based on the viewpoint and occlusion, certain depth boundaries also contain multiple values in a neighborhood. In further sections we detail how we handle some of the above, and their effects on results. 3.2. Data cost Traditionally, the data cost is calculated for each disparity label using the differences in color between the rectified left and right stereo images I l and I r and/or the differences in image gradients G l and G r. For a pixel p, the traditional matching cost for assigning label f p is calculated as (b) D tr (p, f p ) = αc I (p, f p ) + βc G (p, f p ), f p L, (2) where C K (p, f p ) penalizes deviations from Kinect disparities, and ρ is a scalar weight. The Kinect gives good estimates of the disparities at many scene points, although they have limited depth resolution. We alter the search range according to a confidence measure for the Kinect samples. Kinect based term: The term penalizing deviations of stereo disparities from Kinect disparities is defined as C K (p, f p ) = f p K u (p) 2 /(2 S(p) 2 ), where K u (p) is the Kinect disparity at stereo image pixel p, and S(p) is the search range obtained using Kinect and image confidence. The Kinect disparity K u (p) in the stereo image is obtained as follows. To avoid clutter in notation we describe the confidence and cost calculation with respect to the reference image I r without the use of the subscripts. Let K be the sparse disparity map obtained from transferring the valid Kinect depth values to the reference image (see Figure 2(a)), and K n the corresponding nearest neighbor upsampled map (see Figure 2(b)). In order to smooth the upsampled map, and align the depth edges closer to the corresponding image edge, we obtain a map K u through guided filtering of K n using the stereo image as the guide [6]. The cost C K penalizes deviation of disparities around the Kinect suggested value based on the relative confidence of the Kinect and image matching result. This is controlled through the search range S(p) at a pixel, which also controls the set of potential disparities L p at the pixel. As detailed further, pixels which can be confidently matched through image information are set to have a larger search range. This ensures we can obtain higher resolution in depth than the Kinect in textured areas. On the other hand, pixels in texture-less areas are given a lower S(p) so that there is a larger penalty for deviation from Kinect disparity. Disparity Search Ranges, Sensor Confidences: For each pixel, we limit the potential disparities to a range around each sparse Kinect disparity present in a neighborhood of the pixel. The search range at each pixel is calculated as S(p) = max(s k (p), S i (p)), where S k, S i are derived from Kinect and image confidence as follows. To obtain the Kinect confidence based search map, S k, we first measure the density of valid values in K on neighborhoods of 20 20 pixels. Due to the sparsity of K, some neighborhoods have fewer Kinect suggested disparities. In addition, the alignment errors can lead to multiple candidates (foreground and background) at depth boundaries as illustrated in Figure 2(d). Thus in regions of both low and high density, we would like stereo matching to test a wider range of labels, and around both foreground and background values at boundaries. We thus compute the ab-

3.3. Smoothness Cost: The smoothness term V (p, q, fp, fq ) has traditionally been derived from the color similarity of neighboring pixels V tr (p, q, fp, fq ) = λ fp fq e Ir (p) Ir (q) /σi. (a) Search map S (b) Smoothness map λ (c) Traditional smoothness cost (d) Our smoothness cost Figure 3. The different maps derived from Kinect and stereo image information towards data and smoothness cost estimation. The display mapping is black (low) to white (high value). solute value of the measured density minus its mode over the image, and smooth the result. Given this filtered density map M, we set Sk (p) = min(ms, ms M (p)) with limit parameter ms. To estimate the image confidence based search map, Si, we look at the uniqueness of a 5 5 patch. The score is based on the lowest color based error of the patch in the scanline. A region with low texture or repeating pattern would have neighboring patches with very low error, and thus cannot be confidently matched. In such regions, we would like to limit the number of labels tested around the value suggested by Kinect. On the other hand, in regions of high (non-repeating) texture, we would like to allow a larger search to extract fine geometries not observed by Kinect. We derive the color error based score map U as U (p) = min q R(p),q6=p X Ir (p + δ) Ir (q + δ), (p+δ) N (p) where R(p) = [max(1, p ms ), min(w, p + ms )] is the set of pixels along the scanline, w the width of the image, and N (p) is a neighborhood around p. Given the score map U, we calculate Si (p) = 1 + (1 e U (p)/σu ) ms. Given the search map S, we determine the subset Lp L of potential labels for pixel p as the union of intervals around disparities suggested by Kinect: Lp = q N (p) [K(q) S(p), K(q) + S(p)]. Based on the scene depth, image resolution, sparsity, and alignment error discussed before, we use a neighborhood N of 20 20 pixels. For pixels with no Kinect data we test for all disparities. Figure 3(a) shows the map for the sample scene from Figure 1. Note how the ambiguous regions have lower search range compared to edges and textured regions. (4) This is based on the assumption that depth discontinuities coincide with image edges. The converse of the above heuristic, however, is not necessarily true. The Kinect provides regions of discontinuities, and also indicates areas with smooth depth gradients. We use the information in two ways. First, we derive the gradients based on a filtered and up-sampled depth map from Kinect. This eliminates image gradients which do not correspond to depth gradients and allows better label transitions. Second, the regions with discontinuity are also indicative of occlusion. Since our aim is to obtain a dense and smooth depth map, we allow the smoothness term to have a larger weight in such regions by using a spatially varying weight. Thus we define our smoothness term as V (p, q, fp, fq ) = λ(p) fp fq e J(p) J(q) /σs. (5) We obtain the spatially varying smoothness weight λ(p) and the map J to guide label gradient as follows. Let M[a,b] denote the filtered Kinect density map M (detailed in previous section) normalized to the range [a, b]. To refresh, this map M indicated regions where Kinect values can be trusted (lower values) and those which were ambiguous to Kinect (higher values). We calculate J(p) = M[0,1] Irg (p)+(1 M[0,1] )Ku (p), where Irg is the grayscale version of Ir. This map is used to derive the smoothness cost based on the gradients from the low resolution scene depth where the kinect information can be confidently used, after suitable up-sampling and filtering. In other regions it employs the image information as done in the traditional formulation. Figure 3(c-d) demonstrates the advantage of our map compared to the traditional cost derived from image gradients. As can be observed from Figure 3(c) (plot of e Ir (p) Ir (q) /σi ), various strong image edges are present within geometrically smooth regions such as the chessboard pattern and textures on the cloths. In contrast, our processing retains only the depth edges and suppresses images edges which do not correspond to depth gradients as shown by the plot of e J(p) J(q) /σs in Figure 3(d). More importantly, the use of the low resolution depth from Kinect in the component Ku, provides depth gradients in texture-less areas such as the single colored boards. Traditionally, the weight λ is a fixed scalar, but we observed that different regions benefit from varying contribution from data cost and smoothness cost. In regions where the image texture was found confident, the data cost should contribute higher. In regions near depth edges, the presence of occlusion and the alignment error benefit from a

(a) Greedy on traditional DC (b) Greedy on our DC (c) Greedy on filtered DC [7](d) Joint bilateral up-sampling (e) Traditional GC (f) Our result Figure 4. Comparison of results from different schemes on the sample scene from Figure 1. Here DC stands for data cost and GC stands for Graph Cuts optimization. larger smoothness weight. Since the search range and data cost have already been tuned to balance the image and Kinect confidence, we handle the latter factor in the smoothness weight map. This is done by mapping regions of low Kinect confidence to have a larger smoothness cost. Thus, λ = M[s1,s2 ], where s1 and s2 are based on the desired low and high smoothness weight. Figure 3(b) shows the map corresponding to the sample scene. Parameters: For all our experiments σu = 50, ms = 30, α = 0.5, ρ = 100, σs = 10, s1 = 10 and s2 = 100. 4. Experiments and results We present various qualitative results in this section and quantitative comparison in the supplementary file 1. All the images are best viewed on a computer monitor with a higher zoom or as high resolution images in the supplementary file. Alpha expansion on the traditional stereo formulation took several hours to converge, while it completes in our fusion formulation in 10-20 minutes. Recent fusion methods using global optimization, such as [12], were only demonstrated on resized 2MP images due to high computational needs and thus cannot be run on our 4-12MP images. Latest local methods such as [7], though scalable, cannot solve the 1 PUT WEBSITE LINK Figure 5. Two views of the 3D point cloud reconstructions of Santa figurine from our results (left) and Kinect nearest neighbor upsampling (right). The reconstruction shown here is part of the Santa scene shown in Figure 6. problem of ambiguity in texture-less or repeating texture regions as shown in our comparisons. Patch-based matching cannot fully overcome the matching ambiguity either. We start with the sample scene from Figure 1 to illustrate the differences between various schemes since it contains surfaces where the two sensors behave differently. We use integer disparities and the sample scene contains 300 labels for the cropped image size of 1700 2500 pixels after rectification (to exclude boundaries and blank areas from rectification warping). Figure 4 compares the different schemes. The greedy approach assigns the label with minimum data cost to each pixel. Figure 4(a) shows the greedy algorithm applied on the cost volume from the traditional scheme (Eqn. 2). Figure 4(b) shows the greedy answer from the proposed data cost (Eqn. 3). We can observe that this already improves the result in texture-less regions such as the green board where Kinect information was available. However, ambiguous regions of the wall with no Kinect information still suffer from noisy and incorrect labeling. Recently [7] proposed a cost volume filtering using image as a guide [6]. Though the method was successfully used to recover sharp depth discontinuities and small structures, it does not address the problem in texture-less areas. Also, as discussed before, image edges not corresponding to depth discontinuities can lead to incorrect geometry. This is demonstrated in Figure 4(c). Joint bilateral up-sampling of the transferred Kinect depth also suffers from some of the above problems as shown in Figure 4(d). Due to the sparsity of the Kinect samples, we must use a large sigma and range for the bilateral filtering, which in turn leads to over smoothing in some regions. The mixed depth values at boundaries discussed before leads to further errors. Since up-sampling

methods do not allow preference of one value over another, such errors and those resulting from noisy sensors cannot be fully corrected. We tested with different parameter values and show the best results. A global optimization can overcome some of the problems. In Figure 4(e), we show the results from traditional stereo. Though noise is reduced, the repeating texture of the chessboard and the lack of gradients in the single colored boards leads to incorrect labeling. For example, the tilt of the green board is not captured. As shown in Figure 4(f), our result overcomes the problems discussed above. Note the correct depth gradients on the texture-less regions, correct depth edges compared to up-sampled result, and low noise compared to the greedy approaches. We illustrate the recovery of fine structures in Figure 6. Rows 1-2 show parts of the sample scene. Note correct recovery of the indentation on the right side of the Gasoline can, and fine details of the scarf on the teddy. Rows 3-4 show results from the Santa-scene. Observe that our results provide better gradient on the two textured cloths. They were pinned on two ends, resulting in a U-like depression, which is better recovered in both our results and the traditional stereo, as compared to the Kinect and its up-sampled results. Rows 5-6 show a scene with a specular, concave ceramic bowl. Note the correct recovery of the concave shape on the top half of the bowl compared to the traditional stereo formulation. The thin edges of the bowl and the folds on the cloth below it are recovered correctly in contrast to the Kinect and up-sampled results. In row 7 note the correct disparity for the repeating pattern (chessboards) compared to traditional stereo. Our results also recover better depth boundaries compared to the up-sampling schemes. Figure 5 shows 3D reconstructions from our method and Kinect up-sampling for the Santa figurine. The black arrows point to some regions where our method captured better geometry. For example, the spring structure in the mid-center of Santa, the correct recovery of the top of the head and hand, more levels in the face and nose area. The comparisons thus illustrate that the use of Kinect alone or its up-sampled results are not sufficient to obtain good correspondences at the resolution of the stereo images, which is the goal of our work. It can, however, be used effectively in regions where traditional stereo matching fails. 5. Conclusion In this paper we showed the effective fusion of Kinect and stereo sensors to obtain high quality correspondences at the spatial and depth resolution of the stereo images. We proposed novel data and smoothness costs for a global optimization framework that combines depth information from Kinect with stereo matching. Compared to the traditional stereo formulation, the optimization was modified by limiting the number of potential disparity labels at each pixel given the Kinect depth estimate, biasing the final disparity toward the Kinect-suggested geometry, and encouraging depth discontinuities to align with the Kinect data. Our results show gain in depth resolution, recovery of fine geometry, and correct depths in ambiguous regions, thus overcoming the weakness of both sensors towards effective fusion. A potential future direction would be to fit low dimension parametric models such as planes/spheres to parts of the Kinect data, and using that to enforce a stronger geometry/gradient prior in the smoothness term. References [1] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 23(11):1222 1239, 2001. 2, 3 [2] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3d shape scanning with a time-of-flight camera. In CVPR, 2010. 2 [3] J. Diebel and S. Thrun. An application of markov random fields to range sensing. In NIPS, 2005. 2 [4] A. Fusiello and L. Irsara. Quasi-euclidean uncalibrated epipolar rectification. In ICPR, 2008. 3 [5] U. Hahne and M. Alexa. Depth imaging by combining timeof-flight and on-demand stereo. In DAGM Workshop on Dynamic 3D Imaging, 2009. 2 [6] K. He, J. Sun, and X. Tang. Guided image filtering. In ECCV, 2010. 4, 6 [7] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and beyond. PAMI, 2012. 6 [8] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, 2011. 2 [9] J. Park, H. Kim, Y.-W. Tai, M. Brown, and I. Kweon. High quality depth map upsampling for 3d-tof cameras. In ICCV, 2011. 2 [10] S. Schuon, C. Theobalt, J. Davis, and S. Thrun. Lidarboost: Depth superresolution for tof 3d shape scanning. In CVPR, 2009. 2 [11] K. H. Strobl, W. Sepp, S. Fuchs, C. Paredes, M. Smisek, and K. Arbter. DLR CalDe and DLR CalLab. 3 [12] L. Wang and R. Yang. Global stereo matching leveraged by sparse ground control points. In CVPR, 2011. 2, 6 [13] Q. Yang, K.-H. Tan, W. B. Culbertson, and J. G. Apostolopoulos. Fusion of active and passive sensors for fast 3d capture. In IEEE International Workshop on Multimedia Signal Processing, 2010. 2 [14] Q. Yang, R. Yang, J. Davis, and D. Nister. Spatial-depth super resolution for range images. In CVPR, 2007. 2 [15] J. Zhu, L. Wang, J. Gao, and R. Yang. Spatial-temporal fusion for high accuracy depth maps using dynamic mrfs. PAMI, 2010. 2

Images Our result Traditional stereo Nearest neighbor Joint bilateral up-sampling Figure 6. Results and comparisons on different scenes. The scenes were approximately 2-3 meters from the cameras, and the number of disparity labels range from 200-500 levels. The images are best viewed on a computer screen. Rows 1-2: Details of structures recovered in the sample scene of Figures 1, 4. Rows 3-4: Results and details for Santa-scene. 3D reconstructions are shown in Figure 5. Rows 5-6: Results and details for Bowl scene. Row 7: Results for room scene.