Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo xu@cs.ucla.edu Abstract In this report, we present a system that can automatically detect and 3D reconstruct the target objects. After the reconstruction, users can edit the image by scaling, translating, rotating and deleting the target objects. The system consists of three parts. The first part is the detection part. We use HOG (Histogram of Gradient) template matching for detection and adopt sliding window strategy with multiple scales and positions. We allow a small deformation of the HOG template. Non-maximum suppression is applied to remove duplicate detected windows. The second part is the 3D transformable model matching part. We will search for the optimal 3D models that match the edge maps of the detected target objects. The third part is the image editing part, the users can rotate, translate and scale multiple target objects detected in the image using simple operations of their mouse. The whole system will reduce the effort the user need to take, especially in the situation that the user need to 3D reconstruct near-duplicate objects in multiple images. 1. Introduction 3D modeling of objects in an image has been an interesting research topic for computer graphic community for several years. One of the limitation of these 3D modeling algorithms is that they generally require the user to input the information of the objects in the image. It will take users lots of time and effort if they want to 3D reconstruct multiple duplicated objects in several images. At the same time, object detection algorithm has also been widely studied in computer vision community. Our motivation of this project is that we can utilize the state-of-the-art algorithms in object detection to find and 3D reconstruct the objects the user want automatically. In this section, we will briefly review the recent progress of object detection and 3D modeling. Then we will introduce the goal of our system, the three component of the system and detailed steps to build the system. Lots of the images features was proposed recently, such as the HOG (Histogram of Gradient) features [] and SIFT (Scale Invariant Feature Transform) features [5]. Both of the features utilize the edge information in the images and use histogram to achieve the invariance of image conditions. There are lots of advanced methods that are based on these two features. Specifically, the Deformable Part Based Model [3] uses several part filters and a root filter to detect object in the images. They extract HOG features with several scales and use them as the image feature for the part filters and root filter. They achieve the state-of-the-art results in several benchmark object detection databases, such as Pascal VOC. Most recently, the Deep Convolutional Neural Network reports the large performance improvement than the transitional computer vision algorithm given millions of training images [4]. Speaking of 3D modeling, traditional interfaces offer tedious mouse-based click and drag. but great advances in usability have been made in recent years since the focus point is shifted toward improving interaction. Sketch-based interfaces for modelling (SBIM) are a promising trend in user interaction. Extracting three dimensional objects from a s- ingle photo is still a great challenge.[6] With the assistance of user input sketch, several impressive works have been done. We list some of them here: Olsen & Samavati s [6] Image-Assisted Modelling system require user sketches object boundary and features on a single image. Shtof et al.[7] reconstruct 3D shapes while simultaneously inferring global mutual geosemantic relations between their parts. And the most impressive one 3-Sweep[1], also need user s help to implicitly decompose the object into simple parts, which are semantically meaningful. None of previous 3D reconstruction work is based on single image and, at the mean time, without user s help. On the basis of all these methods in computer vision and graphics, we want to build a system that can: 1. Detect 1
Sample test image HOG scale 1 (c) HOG scale 9 (d) HOG scale 18 Figure 1. Example of the template image. The visualization of HOG template the locations of the objects we want to find;. Match the 3D transformable object model to the detected objects; 3. Allow the users to Edit the image by rotating, scaling and translating the objects. The inputs of the users are a single template image of the object they want to find and its template 3D model. To achieve this goal, our system consists of three parts: the detection part (see Section ), the 3D reconstruction part (see Section 3) and the image editing part (see Section 4). For the detection part, we need to 1. extract image features of the template image;. detect the objects in the testing images; 3. estimate the initial 3D model of the detected objects; 4. remove the detected objects to get the background images. In the 3D reconstruction part, the system will match the transformable model to the edge maps of the detected objects and 3D reconstruct them. Finally, in the image editing part, we design a User Interface that can help the user to edit the image. We will introduce these parts in details. Figure. Example testing image. (c)(d) Visualization of HOG features of several scales.. Feature matching Because the size and position of the objects in the testing images are arbitrary, we proposed a deformable template feature matching method with sliding window strategy to detect the objects in the testing images. Specifically, we extract features from multiple scales (0 scales in total) and use sliding window approach to exhaustively search for each position position with each scales of the objects. We also allow a small deformation of the HOG template. We adopt Normalized Cross-Correlation as the similarity measurement. Figure shows a sample testing image and the visualization of its HOG features of several scales. After the sliding window strategy, we will get lots of candidate object bounding boxes (see Figure 3). But many of them are duplicate bounding boxes and we need to get the most accurate ones. Therefore, Non-Maximum Suppression is applied among the candidate bounding boxes that have IoU (Intersection-Over-Union) ratio larger than 0.5 with each other. The IoU ratio between two candidate. Object Detection Part.1. Image features There are some difference between the transitional object detection system and our system. For transitional object detection system, they typically have lots of training images and they allow the objects to have large visual variance in the images. On the contrast, for our system, we only have one training image (the template image) and we want to find near-duplicated objects in the testing images. Therefore, we choose HOG template as the features. For template image, the HOG feature size is 30 pixels. The feature dimension for one HOG cell is 31. Please see Figure 1 as an example. To further increase the discriminative power of our feature descriptor, we also extract Color Histogram. Specifically, we convert the images to HSV color space. The Hue channel, Saturation channel and Value channel are quantized into 8 bins, 4 bins and bins respectively. This leads to a feature with dimension 64. Figure 3. Candidate object bounding boxes after sliding window search. Detection results after applying Non-Maximum Suppression
Original image Background Figure 4. More detection results of our system under different image conditions and complex background. bounding boxes is calculated as the area of the intersection region of the two bounding boxes divided by the area of the union region of the two bounding boxes. We can get the final detection results after Non-Maximum Suppression (see Figure 3). In Figure 4, we show more detection results. Our system can successfully detect near-duplicated objects in complex background and different image conditions. (c) Edge maps and texture Figure 5. Example of the output of the detection part 3.1. Modeling.3. Interface of the first two part We model a cup by 4 faces(outside wall, inside wall, outside bottom and inside bottom) and it is applied with flat shading (Figure 6). The texture of each cup is extracted in the detection part, and it is applied to only the outside wall of the cup. The rest faces of the cup are simply painted white. In order to control a cup model, we need parameters to specify not only its shape and coordinates, but also its orientation. Therefore, we control the model by the following seven parameters (Figure 6): The output of the first part (detection part) will serve as the input of the second part (3D reconstruction part). The output includes: Original testing image Image without target objects (background image). We use Partial Differential Equation based image inprinting to refill the detected bounding boxes regions. Edge map for each detected object Distance map of every pixel to its nearest edge pixel. We use distance transform to calculate the distance map, which has O(W H) time complexity, where W and H are width and height of the testing image. This step is important to accelerate the optimization step in 3D reconstruction part. r0 : the radius of the bottom circle face; r1 : the radius of the top circle face; (x0, y0 ): the centre s coordinate on the drawing plane; < x, y, z >: a vector specify the orientation of the cup, which p also indicates that the height of the cup is h = x + y + z ; Texture of each detected object Estimated initial 3D model parameters (e.g. radius of the bottom and top circle of the cup, the orientation) The example outputs of the first part are shown in Figure 5 (Distance map and 3D model parameters are not presented). 3. 3D Reconstruction Part In this section, we will introduce how we use the output of the detection part to reconstruct the 3-D cup models, project them on the background image with estimated position, and optimize their position. Figure 6. The direction of normals on 4 faces indicates the cup is applied with flat shading. The model is controlled by 7 parameters. 3
Figure 9. Representative points are projected on the edge map image to calculate a quadratic energy function. Figure 7. Perspective projection: background image at far plane and cups centres at far plane. 3.. Perspective Projection We use perspective projection (Figure 7) to project cup models on the background image. We draw the background image on the far plane and set the size of far plane equals to the size of the background image. Then we put cups centres at far/ plane so that cups will not be occluded by anything. The position of far plane is determined by the height of the background image H and fovy (i.e. field of view y): Z far = tan(90 o fovy ) H For the near plane, we just simply set Z near := 1. Together with parameter (x 0, y 0 ), the 3-D coordinate of the centre of cup in World Coordinate System is (x 0, y 0, Z far ). 3.3. Estimation Cup model is controlled by seven parameters. We yield these values by two steps, estimation and optimization. Good estimated initial values will release the burden of optimization no matter what optimization method is used. Recall that for each cup, the output of the detection part includes a bounding box and a template ID. For parameter (x 0, y 0 ) which determine the position of the model, we simply initialize it by the center of the bounding box: x 0 = y 0 = x b + w b W y b + h b H where (x b, y b ) is the coordinate of the lower-left corner of the bounding box, (w b, h b ) is the width and height of the bounding box, and (W, H) is the size of background image. The estimated initial values should all be divided by two, since cup models are drawn at far/ plane. For other parameters r 0,r 1,< x, y, z > which determine the shape and the orientation of the model, we initial them by pre-set values according to the template ID. We believe this is a naive yet effective way to estimate the shape and the orientation, since the object is classified to certain template for reasons and the number of classification template is limit. However, different objects with same template ID will share the same estimated shape and orientation (Figure 8). It is optimization s task that vary different objects. 3.4. Optimization The target of optimization is to make the projected cup fit the original object well. Specifically, we want to select parameters which minimize this quadratic energy function E = (x i, y i ) (x ei, y ei ), where (x i, y i ) are coordinates of projected representative points on the cup model (we select 3+16+8+8 representative points, see Figure 9 for more details) and (x ei, y ei ) is the nearest edge point (Figure 9) of projected point (x i, y i ). And we discuss two algorithm used to minimize this quadratic energy function in the following respectively. 3.4.1 Naive algorithm We only implement a naive algorithm so far, which is just to enumerate each parameters. Specifically, each parameter has seven offset choices, from 3δ to +3δ. For instance, we try parameter r 0 by r 0 3δ r0, r 0 δ r0, r 0 δ r0, r 0, r 0 + δ r0, r 0 + δ r0, and r 0 + 3δ r0, where r 0 is the estimated initial value and δ r0 is determined by the size of bounding box. For each set of parameters θ = (x 0, y 0, r 0, r 1, x, y, z ), we can calculate E(θ) by E = (x i, y i ) (x ei, y ei ) = dis[x i ][y i ] where dis[x i ][y i ] is the distance between projected point (x i, y i ) to its nearest edge point(x ei, y ei ), which can be precalculated by the detection part. Therefore, the time com- 4
Estimated Original (c) Optimized Figure 8. Comparison between original image, result after estimation and result after optimization. plexity of this algorithm is O(W H) for distance map calculation and O(7 7 (3 + 16 + 8 + 8)) for enumeration. The bottleneck is the number of choices for each parameter during enumeration. Comparing with the estimated one (Figure 8), the result of this optimization algorithm (Figure 8(c)) fit the original (Figure 8) better: the left one get bigger, the middle one get smaller, and the right one get skewed a little bit. Figure 10. Image editing results 3.4. EM algorithm We can use Expectation-Maximization algorithm instead of the former naive one to optimize parameters. Gradient Descent is not applicable, since the nearest edge point for each projected point will be updated. However, we can use EM algorithm which applies gradient descent in each iteration: Initialize θ by estimated value (x 0, y 0, r 0, r 1, x, y, z) while(e(θ) > threshold) { (x i, y i ) := (f xi (θ), f yi (θ)), for all i; Find each point s nearest edge point; Formulize E θ ; Apply Gradient Descent; } Although we do not implement it in our system yet, we believe this guided optimization algorithm will lead to a better results. It is one of our future works. 4. Image Editing Part After reconstruction, we can do image editing on the user interface. Image editing in our system simply means changing cup models parameters. User can translate the cup by right click the target cup and dragging it, rotate the cup by left click the target cup, and scaling a target cup by left and right click together. Some sample results are shown in Figure 10. References [1] T. Chen, Z. Zhu, A. Shamir, S.-M. Hu, and D. Cohen-Or. 3- sweep: Extracting editable objects from a single photo. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 013), 3(6):Article 195, 013. [] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 005. CVPR 005. IEEE Computer Society Conference on, volume 1, pages 886 893. IEEE, 005. [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 3(9):167 1645, 010. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, volume 1, page 4, 01. [5] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume, pages 1150 1157. Ieee, 1999. [6] L. Olsen and F. F. Samavati. Image-assisted modeling from sketches. In Proceedings of Graphics Interface 010, GI 10, pages 5 3, Toronto, Ont., Canada, Canada, 010. Canadian Information Processing Society. [7] A. Shtof, A. Agathos, Y. Gingold, A. Shamir, and D. Cohen- Or. Geosemantic snapping for sketch-based modeling. Computer Graphics Forum, 3():45 53, 013. Proceedings of Eurographics 013. 5