Find Me: An Android App for Measurement and Localization

Transcription

1 Find Me: An Android App for Measurement and Localization Grace Tsai and Sairam Sundaresan Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor Abstract We present a novel method to compute the 3D position of any point in a scene up to scale, measure objects within the scene and also localize the user in a scene using an android phone. Our method uses simple geometry and loose assumptions to compute the 3D configuration of the scene. We do not rely on Manhattan world assumptions or any other similar constraints about the structure of the scene. Using this application, the user can measure objects in the scene, or measure the scene itself. Experiments and results show that our method has very high accuracy and speed, and is ideal for the mobile platform. I. INTRODUCTION There has been a lot of work recently in the area of mobile computer vision. Various computer vision algorithms have been implemented on portable devices such as mobile phones to perform complex tasks such as object recognition, object tracking, image matching, scene understanding, etc. These algorithms have been used for barcode scanning, mobile shopping, location recognition and so on. There has however not been a lot of work done towards measuring a room or a scene using a mobile phone. More precisely, image understanding has not been extended towards measuring objects in the scene or the dimensions of the scene itself. We believe that there is a lot of untapped potential here, particularly because of the numerous applications in this area. Our method can be used by physically challenged people for navigation in a wheelchair. Physically challenged people would like to have the ability to navigate and move seamlessly while on a wheelchair. Modern electronic wheelchairs have a lot of controls on them, hence making them cumbersome for the people who use them. Instead if the person was able to select where he or she wanted to go on the phone, the phone could then return the coordinates of that location, making it easier to provide input to the wheelchair. Augmented reality has gained a lot of popularity recently. We believe that our app can also be used in such applications. For example, for playing games like squash, ping-pong etc between two android phones. When parking space is hard to find, under such cramped conditions, it becomes even more hard to judge whether your car can fit between two cars. A person can use this app to determine whether there is sufficient space to park between two cars in a parking lot. This can also be used to reconstruct a 3D model of the environment for applications such as virtual tours etc. This algorithm can also be used to measure the room and furniture inside it. For example, if a person has just moved into a house or wants to get some new furniture, he/she may not be sure of how much space is available. Using this app, he/she can measure the room up to scale, and then at the store measure the furniture up to the same scale and decide if it will fit nicely or not. The main motivation behind this project was to be able to develop a simple method which could provide meaningful information to the user from a single image and help him/her make informed decisions using the information. A. Previous Work II. RELATED WORK There are three main areas which are related to our work. These areas can be classified as single image methods, Visual SLAM and Structure from motion. Both Stucture from Motion and Visual SLAM techniques make use of a stream of images to produce a model of the scene in the form of a dense 3D point cloud. These methods produce very high quality models of the scene, but there are two issues which arise when such techniques need to be implemented on a mobile platform. The first issue is speed. These methods are computationally intensive which makes them difficult to implement on a phone. Even in the case of a client server model, where the server does the main computation, these methods would still suffer from a severe bottleneck in terms of computation time. The second issue is bandwidth available, i.e., that of sending a video to the server. Both the methods mentioned above require a large number of frames for building a good model of the scene. This in turn would mean that the length of the video recorded has to be large enough. Sending large files over to the server and back would have a crippling effect on the efficiency of the application in terms of time. The other popular technique used is that of Single image based reconstruction [6], [8], [3], [9], [12], [5], [10], [14]. Single image methods have the advantage that only a single image needs to be sent from the client to server. This allows for large savings in time, and also in terms of bandwidth. Saxena et.al used Markov Random Fields to identify the orientation and 3D positions of super pixels in the image and then reconstructed a 3D model from it. Hebert et.al have also done a lot of work in this area. In [7], they generate a coarse geometric model of the scene which divides the scene into three regions, namely ground, sky and surfaces. In [4], they develop a qualitative model of the scene based on blocks.

2 Hoiem et.al [5], generated a box layout of indoor scenes which had a significant amount of clutter in them. The common theme in single image approaches is that of learning a mapping from image features to depth, or surface orientation, etc. As a result, most of these approaches require extensive training and are also computationally expensive. In the case of mobile platforms, these algorithms would not be able to run in near real time speed, and would also impose severe bandwidth constraints. B. Contributions We propose a simple method that exploits geometric properties of the scene and does not require training. Also, we wish to encourage user interaction with the algorithm, which we believe will enhance user experience, and will also provide solutions which are user friendly. Furthermore, our method requires no prior training and this enables our technique to generalize well over different environments. We also do not impose hard constraints on the structure of the scene such as the Manhattan World Assumptions. We only require that the walls be perpendicular to the ground plane. Our method borrows assumptions from the paper by Delage et.al [3], [2]. The rest of the paper is organized as follows, section III deals with the technical details of our algorithm, in section IV, we present some experimental results and section V contains the conclusions. A. Overview III. TECHNICAL DETAILS The whole system is divided into client and server subsystems. The majority of the computation is done on the server. The system diagram shown in Figure 1 and 2 explains the split up of processing as well as the overall process flow. There are two modes of operation, namely single image mode and two image mode. In the single image mode, the user can find out the 3D coordinates of any point in the scene from a single image. For this, the user takes an image, and sends the image to the server. The server then generates various ground wall boundary hypotheses for the image received. These hypotheses are then superimposed on the image and sent back to the user. The user can then choose which hypothesis is the best and send his/her choice to the server. The server then sends back the data pertaining to that hypothesis to the client. Finally, 3D computation is done interactively where the user can specify points on the screen and then have the client compute the 3D coordinates of those points. In the two image mode, a similar procedure is followed where the user takes the first image, sends it to the server. The server sends back the various ground wall boundary hypotheses to the client. The user picks the best hypothesis and the choice is sent to the server. A similar procedure is adopted once the second image is taken by the user. Once both images have been processed, the server computes the motion between the two images using SURF feature correspondences. The computed motion results are then sent to the client. Fig. 1. Process Flow for Single Image Mode. The user takes an image, and sends the image to the server. The server generates various ground wall boundary hypotheses for the image received and send back an selection image with all the hypotheses superimposed on the original image. The user choosees which hypothesis is the best and send the choice to the server and the server sends back the data pertaining to that hypothesis to the client. Finally, 3D computation is done interactively where the user can specify points on the screen and then have the client compute the 3D coordinates of those points. Fig. 2. Process Flow for Two Image Mode. A similar procedure as the single image mode is followed where the user takes the two image, sends it to the server. The server sends back the various ground wall boundary hypotheses to the client. The user picks the best hypothesis and the choice is sent to the server. Once both images have been processed, the server computes the motion between the two images using SURF feature correspondences. The computed motion results are then sent to the client. The client server split up is as follows: The client captures the images and computes the 3D positions of points in the scene in the Single Image mode. The Server generates the hypotheses, and also computes the motion in the Two Image mode. Figures 1 and 2 explain the process in more detail. B. Ground-Wall Boundary Hypotheses A ground-wall boundary hypothesis is defined as a polyline extending from the left to the right boundaries of the image (Figure 3(a)), similar to [3]. The area enclosed by the polyline defines the ground plane (Figure 3(b)) and a line segment defines a wall plane in the physical world (Figure 3(c)). A ground-wall boundary hypothesis can be generated by connecting line segments to bound a region in the image space. Thus, we start by extracting line segments using hough transform and merging them into long straight lines as shown in Figure 3(d). A hypothesis can be any combination of these

3 (a) Example of a Ground-Wall Boundary Hypothesis (b) Ground Plane (a) 3D Coordinates of Ground-Plane Points (c) Wall Plane (d) Line Extraction (e) Occluded Edges (f) Identifying Vertical Lines (b) 3D Coordinates of Wall-Plane Points Fig. 4. (a) The 3D coordinate of a ground plane point can be determined by the intersection of the camera ray and the ground plane with normal vector n g and known camera height h.(b) A wall plane equation is determined by its corresponding ground-wall boundary line segment. The intersection of the camera ray and the wall equation determines the 3D coordinate of a wall plane point. (g) Lines Below the Camera Center (h) Cluster the Lines into Three Groups Fig. 3. Hypothesis generation. (a) A ground-wall boundary hypothesis is defined by a polyline extending from the left to the right boundaries of the image. (b) The ground plane is defined by the enclosed area of the polyline. (c) A line segment in the hypothesis defines a wall in 3D. (d) Line extraction using hough transform. (e) Vertical lines (pink lines) in the polyline implies occluded edges. (f) Vertical lines are identified based on their slopes. (g) The green point is the camera center and the candidate lines are the lines that are below the camera center. (h) The candidate lines are clustered into three groups (i.e. left, right, front) based on their slopes. lines. To simplify the problem, we model the indoor environment using only three walls (i.e. left, right, and front walls) and the ground plane with no occluded edges. Therefore hypotheses with vertical lines are not considered since vertical lines imply occluded edges (Figure 3(e)). Vertical lines are identified based on their slopes as shown in Figure 3(f). Since the camera is always placed above the ground plane, all the lines in a feasible hypothesis are below the camera center (Figure 3(g)). Thus, lines that are above the camera center are removed. We, then, cluster the lines into three groups, (i.e., left, right, front) based on their slopes in the image as shown in Figure 3(h). A hypothesis can be formed by selecting one line from each group and connecting them. We define the edge support of a hypothesis to be the fraction of the length of the boundary that consists of edge pixels which can be obtained by any edge detector. Hypotheses with edge support below a threshold are removed. C. 3D Coordinates of Ground-Plane and Wall-Plane Points In the camera space, the 3D location P i = (x i, y i, z i ) T of an image point p i = (u i, v i, 1) T is related by P i = z i p i for some z i. If the point P i lies on the ground plane with normal vector n g, the exact 3D location can be determined by the intersection of the line and the ground plane (Figure 4(a)): d g = n g P i = z i n g p i (1) where d g is the distance of the optical center of the camera to the ground. By setting the normal vector of the ground plane to n g = (0, 1, 0) T and the distance d g as the camera height, h, 3D location of any point on the ground plane with image location p i can be determined by x i u i /v i P i = y i = h z i 1 1/v i. (2) The camera height h is given by the user or can be set to h = 1 without loss of generality. A wall plane equation is determined by its corresponding ground-wall boundary line segment in a given ground-wall boundary hypothesis as shown in Figure 4(b). Points on the ground plane, we determine the normal vector of a wall plane n w based on the ground-wall boundary line segment, n w = n g v b (3) where v b is the 3D direction vector of the boundary line segment. If a point lies on the wall plane with position p j in the image space, its 3D location can be determined by d w = λ j n w p j (4) where d w can be obtained by any point on the boundary line.

4 D. Camera Motion Estimation To estimate the camera motion, we detect a set of point features and track them across frames. This feature tracking can be obtained by many methods, such as SIFT[11], SURF[1] and KLT[13] feature tracking or matching. In this project, we use SURF feature matching because it is more efficient than SIFT feature matching and it is more accurate than KLT features for large baselines. Since the 3D location of the feature points are static in the world coordinate, camera motion between two frames can be estimated by observing 2D tracked points. This camera motion estimation is equivalent to estimating the rigid body transformation of the 3D point correspondences from a still camera under the camera coordinate system. In this project, we assume that the camera is moving parallel to the ground plane so the estimated rigid body transformation of the points contains three degrees of freedom, ( x, z, θ), where x and z are the translation and θ is the rotation around y- axis. Based on the point correspondences in the image space, we reconstruct the 3D locations of the point features for both frames individually under the ground-wall boundary hypothesis selected by the user. Given the two corresponding 3D point sets in the camera space, {P i = (X P i, YP i, ZP i )T } and {Q i = (X Q i, YQ i, ZQ i )T }, i = 1...N, in two frames, the rigid-body transformation is related by Q i = RP i + T where R is the rotation matrix, T is the 3D translation vector. The rotation matrix R has only one degree of freedom which has the form cos( θ) 0 sin( θ) R = (5) sin( θ) 0 cos( θ) and the translation vector T = (t x, 0, t z ) T has two degrees of freedom. In order to estimate the three degrees of transformation, the point correspondences are projected onto the ground plane which are denoted as {P i = (X P i, h, ZP i )T } and {Q i = (X Q i, h, ZQ i )T }. The rotation matrix can be estimated by the angular difference between two corresponding vectors, P ip j and Q iq j. Our estimated θ is thus the weighted average of the angular differences, cos( θ) = 1 ωij where ω ij is defined as i j ω ij P ip j Q iq j P ip j Q iq j ω ij = (1/ZP i + 1/Z Q i ) (1/Z P j + 1/ZQ j ) (7) 2 2 where intuitively distant 3D points are given lower weights since the reconstructed 3D positions of distant points are less accurate. The translation vector T is then estimated by the weighted average of the differences between RP i and Q i, T = 1 ωi N i=1 (6) ω i (RP i Q i) (8) (a) (b) Fig. 5. Examples of hypotheses generation using MATLAB and images taken by an regular hand-held camera. where ω i = (1/Z P i + 1/Z Q i )/2. IV. EXPERIMENTS AND RESULTS We first tested our idea by simulating the algorithm in MATLAB. Once we validated the simulation results, we then implemented the algorithm on the android platform by using the Eclipse and OpenCV 2.2. The client used was a Motorola Droid with a 5 Megapixel camera and an Arm Cortex A8 processor of speed 550 MHz. We used a 2.4GHz Intel Core 2 Duo Macbook Pro as the server. For evaluating the quality of our method, we performed some experiments to compare the performance of the android system against a MATLAB implementation. For the MATLAB implementation, we captured images using a high quality hand held camera. Further we also tested the quality of feature matching, and the run time of our method. Finally, we estimated the accuracy of our technique for computing 3D positions as well as for estimating motion. Figure 8 shows some snapshots of our algorithm on the droid phone. A. Generating Wall-Boundary Hypotheses Figure 5 shows the results from the MATLAB simulation of the ground-wall hypothesis generation step. Figure 6 shows the results of the same process from our android implementation. It can be seen that in both cases, the number of hypotheses

5 Fig. 6. Examples of hypotheses generation using Android. generated depends on factors such as features available in the environment, illumination conditions, etc. However, it must be noted that despite the fact that the droid phone does not have a very high quality camera, our algorithm is still able to identify a large set of plausible hypotheses. If the images taken from the android phone are blurred, the algorithm does not perform as well as the MATLAB simulation which uses the hand held camera. In some of such cases, the true groundwall boundaries are not extracted by the edge detector or the line detector. B. Feature Matching For feature extraction and matching, we used SURF matching provided by OpenCV 2.2. We considered several feature descriptors such as SIFT, Fern, KLT, but found that the SURF features are not only efficient to compute and match, but also are reliable over various indoor environments. We found that the matching process worked very well even when there was large motion between the two images. However, it should be noted that the matched features highly depend on the motion between the images. If the motion is large, it becomes less likely that the two images share the same feature points. The results of SURF matching on a variety of environments are shown in Figure 7 C. Timing Analysis Table I shows the split up of time taken for the various stages in the single Image mode. The hypothesis generation step is of variable timing since the number of hypotheses generated is not constant but depends on the environment. The rest of the stages take roughly the same time over each environment. Table II shows the split up of time taken for the various stages in the two Image mode. Here, the time taken is roughly a second more than the single image mode because an additional image needs to be sent to the server. Apart from TABLE I SPLIT UP OF TIME FOR THE SINGLE IMAGE MODE Split up-single Image Mode Process Time Taken Sending and Receiving Data sec Hypothesis Generation (Variable) 1 sec (average) Computing 3D 5 msec Total Time sec that, the rest of the processing stages take the same amount of time compared to the single image mode. TABLE II SPLIT UP OF TIME FOR THE TWO IMAGE MODE Split up-single Image Mode Process Time Taken Sending and Receiving Data sec Hypothesis Generation (Variable) 1 sec (average) Motion Estimation 2 msec Total Time sec The real bottleneck in both modes is the data transfer step. We believe that the speed can further be improved by using some kind of compression strategy to send and receive data. However, in such a case, lossless compression would be preferred because the hypotheses generated depend on the quality of the image. D. Accuracy of the Estimations We find that our method has excellent accuracy given the real world situations in which the phone was tested. The error in the 3D estimation, given the user selected boundary is approximately 0.5 m. In order to compute this, we measured the actual 3D coordinates of a few points in the scene and compared our result with it. It must be taken into account that there would be some discrepancy in the comparison because of human error. The accuracy depends on a number of factors

6 (a) (b) (c) Fig. 7. Examples of SURF Feature Matching. (a) is the matching results from images taken from a regular hand-held camera. (b) and (c) is the matching results using Android phone. such as the tilt involved, the hypothesis selected, etc. The precision tends to reduce in cases where the user selects a bad ground-wall boundary. The error of our motion estimation given the selected boundaries is approximately 0.3m for translation and 5 for rotation.the Rotation estimation is less accurate because we estimated it based on a linear approximation even though the rotation matrix is a non-linear transformation. The translation is more accurate because it is a linear transformation and the optimized solution is the averaged distance between two corresponding 3D point sets as discussed before. V. CONCLUSIONS We presented a novel method to compute the 3D position of any point in a scene up to scale, measure objects within the scene and also localize the user in a scene using an android phone. Our method uses simple geometry and loose assumptions to compute the 3D configuration of the scene. Our current approach depends a lot on image quality. The speed of the process varies according to the number of hypotheses

7 (a) Main Menu (b) Camera Menu (c) Two Image Mode Menu (d) Select Hypotheses (e) Measure 3D Location (Single Image Mode) (f) Motion Estimation (Two Image Mode) Fig. 8. Snap Shot of the Android Phone. The main menu (a) lets the user to choose between the modes, single image mode or two image mode. The single image mode directly lead the user to the camera menu (b). Once the user taken the image and send it to the server, the server sends an selection image back for user to choose the best hypothesis as in (d). The server receive the selection and send details of the selected hypothesis along and begin the 3D location measurements (e). Note that all the 3D measurements are done on the phone. If the user select the two image mode, the two image mode menu (c) will pop up. The user then takes the images and select the best hypothesis using the same process as in the single image mode. Once the images are taken and the motion estimation button is hit, the server estimates the motion and send the details to the phone as in (f). generated. Future work includes developing a mechanism for intelligent hypothesis generation and making the method invariant to environmental changes and image quality. Our current results shows that our method has very high accuracy and speed, and is ideal for the mobile platform. [13] J. Shi and C. Tomasi. Good features to track. In IEEE Conference on Computer Vision and Pattern Recognition, pages , [14] H. Wang, S. Gould, and D. Koller. Discriminative learning with latent variables for cluttered indoor scene understanding. In European Conference on Computer Vision, REFERENCES [1] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. In Computer Vision and Image Understanding (CVIU), volume 110, pages , [2] E. Delage, H. Lee, and A. Y. Ng. Automatic single-image 3d reconstructions of indoor Manhattan world scenes [3] E. Delage, H. Lee, and A. Y. Ng. A dynamic Bayesian network model for autonomous 3d reconstruction from a single indoor image. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR, pages , [4] A. Gupta, A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In European Conference on Computer Vision, [5] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layout of cluttered rooms. In International Conference on Computer Vision, [6] D. Hoiem, A. Efros, and M. Hebert. Geometric context from a single image. In International Conference on Computer Vision, [7] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. In International Journal of Computer Vision, volume 75, October [8] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR), volume 2, pages , [9] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for single image structure recovery. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), [10] B. Liu, S. Gould, and D. Koller. Single image depth estimation from predicted semantic labels. In Conference on Computer Vision and Pattern Recognition, [11] D. G. Lowe. Distinctive image features from scale-invariant keypoints. In International Journal of Computer Vision, [12] A. Saxena, M. Sun, and A. Ng. Make3d: learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 30: , 2009.